Average New Users Per DOW

Started by Robert DiFalcoalmost 11 years ago13 messagesgeneral

robert.difalco@gmail.com

almost 11 years ago

I'm not sure how to create a result where I get the average number of new
users per day of the week. My issues are that days that did not have any
new users will not be factored into the average, giving an overinflated
result.

This is what I started with:

WITH userdays AS
(SELECT u.created::DATE AS created,
to_char(u.created,'Dy') AS d,
COUNT(*) AS total
FROM users u
GROUP BY 1,2),
userdays_avg AS
(SELECT extract('dow'
FROM created) AS nDay,
d AS "Day",
AVG(total) AS "New Users"
FROM userdays
GROUP BY 1,2
ORDER BY 1)
SELECT "Day", "New Users"
FROM userdays_avg
ORDER BY nDay;

But you can see it wont give correct results since (for example) Monday's
with no new users will not be counted in the average as 0.

TIA

Paul Jungwirth

pj@illuminatedcomputing.com

almost 11 years ago

In reply to: Robert DiFalco (#1)

Re: Average New Users Per DOW

I'm not sure how to create a result where I get the average number of
new users per day of the week. My issues are that days that did not
have any new users will not be factored into the average

This is a pretty common problem with time-series queries when there is
sparse data. My go-to solution is to use generate_series---in your case
from 0 to 6---then do a left join from there to your actual data.

Paul

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Robert DiFalco

robert.difalco@gmail.com

almost 11 years ago

In reply to: Paul Jungwirth (#2)

Re: Average New Users Per DOW

Thanks Paul, I guess I'm not sure how a generate_series between 0 to 6
would solve this problem. Wouldn't I have to generate a series based on the
date range (by day) and then group by DOW _after_ that? Can you give me an
example of how I'd do it with a series based on 0 to 6?

On Mon, Jul 6, 2015 at 10:58 AM, Paul Jungwirth <pj@illuminatedcomputing.com

Show quoted text

wrote:

I'm not sure how to create a result where I get the average number of
new users per day of the week. My issues are that days that did not
have any new users will not be factored into the average

This is a pretty common problem with time-series queries when there is
sparse data. My go-to solution is to use generate_series---in your case
from 0 to 6---then do a left join from there to your actual data.

Paul

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David G. Johnston

david.g.johnston@gmail.com

almost 11 years ago

In reply to: Robert DiFalco (#3)

Re: Average New Users Per DOW

On Mon, Jul 6, 2015 at 2:04 PM, Robert DiFalco <robert.difalco@gmail.com>
wrote:

Wouldn't I have to generate a series based on the date range (by day) and
then group by DOW _after_ that?

You are correct.

WITH userdays (dow, user_count) AS ( existing_query, more or less )
, day_counts (dow, count_of_days) AS ( SELECT
generate_series(user_earliest_created_date, user_most_recent_created_date)
)
SELECT dow, coalesce(user_count, 0) / count_of_days
FROM day_counts
LEFT JOIN userdays USING (dow)
;

David J.

Paul Jungwirth

pj@illuminatedcomputing.com

almost 11 years ago

In reply to: Robert DiFalco (#3)

Re: Average New Users Per DOW

Thanks Paul, I guess I'm not sure how a generate_series between 0 to 6
would solve this problem. Wouldn't I have to generate a series based on
the date range (by day) and then group by DOW _after_ that? Can you give
me an example of how I'd do it with a series based on 0 to 6?

Looks like David Johnston beat me to it! :-) But this is what I had in mind:

SELECT s.d AS dow,
COUNT(u.id) c
FROM generate_series(0, 6) s(d)
LEFT OUTER JOIN users u
ON EXTRACT(dow FROM created) = s.d
GROUP BY dow
ORDER BY dow
;

You can also get human-readable DOW names by creating a 7-row CTE table
and joining to it based on the numeric dow.

Paul

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Robert DiFalco

robert.difalco@gmail.com

almost 11 years ago

In reply to: Paul Jungwirth (#5)

Re: Average New Users Per DOW

Paul, I'm sure I'm missing something but it seems like your approach will
not work. It's because the LEFT OUTER JOIN is on the numeric day of the
week. So if you had this query going over weeks or months of data wouldn't
you have the same issue with the days that had no new users not being
factored into the AVG? I ended up doing something like this, which seems
to work pretty well.

WITH usersByDay AS (
SELECT cDate, COUNT(*) AS total
FROM (
SELECT generate_series(
{CALENDAR_INTERVAL.START}::DATE,
{CALENDAR_INTERVAL.END}::DATE,
interval '1 day')::DATE AS cDate
) AS c
LEFT OUTER JOIN users u ON u.created::DATE = c.cDate
GROUP BY cDate),
avgUsersByDOW AS (
SELECT extract('dow' FROM cDate) AS nDay,
to_char(cDate,'Dy') AS "Day",
ROUND(AVG(total), 2) AS "New Users"
FROM usersByDay
GROUP BY 1, 2
ORDER BY 1)
SELECT "Day", "New Users" FROM avgUsersByDOW ORDER BY nDay

On Mon, Jul 6, 2015 at 11:30 AM, Paul Jungwirth <pj@illuminatedcomputing.com

Show quoted text

wrote:

Thanks Paul, I guess I'm not sure how a generate_series between 0 to 6

would solve this problem. Wouldn't I have to generate a series based on
the date range (by day) and then group by DOW _after_ that? Can you give
me an example of how I'd do it with a series based on 0 to 6?

Looks like David Johnston beat me to it! :-) But this is what I had in
mind:

SELECT s.d AS dow,
COUNT(u.id) c
FROM generate_series(0, 6) s(d)
LEFT OUTER JOIN users u
ON EXTRACT(dow FROM created) = s.d
GROUP BY dow
ORDER BY dow
;

You can also get human-readable DOW names by creating a 7-row CTE table
and joining to it based on the numeric dow.

Paul

David G. Johnston

david.g.johnston@gmail.com

almost 11 years ago

In reply to: Robert DiFalco (#6)

Re: Average New Users Per DOW

Please follow list conventions and either respond inline or bottom-post.

On Mon, Jul 6, 2015 at 3:30 PM, Robert DiFalco <robert.difalco@gmail.com>
wrote:

Paul, I'm sure I'm missing something but it seems like your approach will
not work. It's because the LEFT OUTER JOIN is on the numeric day of the
week. So if you had this query going over weeks or months of data wouldn't
you have the same issue with the days that had no new users not being
factored into the AVG? I ended up doing something like this, which seems
to work pretty well.

WITH usersByDay AS (
SELECT cDate, COUNT(*) AS total
FROM (
SELECT generate_series(
{CALENDAR_INTERVAL.START}::DATE,
{CALENDAR_INTERVAL.END}::DATE,
interval '1 day')::DATE AS cDate
) AS c
LEFT OUTER JOIN users u ON u.created::DATE = c.cDate
GROUP BY cDate),

I am fairly certain this does not give you the correct results.
Specifically, the minimum value for each cDate is going to be 1 since
count(*) counts NULLs. count(u) should probably work.

SELECT dt, count(uid), count(*)
FROM generate_series('2015-01-01'::date, '2015-01-05'::date, '1
day'::interval) gs (dt)
LEFT JOIN (VALUES ('2015-01-01'::date, 1),
('2015-01-01',2),('2015-01-02',3)) users (dt, uid)
USING (dt)
GROUP BY dt
;

David J.

Robert DiFalco

robert.difalco@gmail.com

almost 11 years ago

In reply to: David G. Johnston (#7)

Re: Average New Users Per DOW

I am fairly certain this does not give you the correct results.
Specifically, the minimum value for each cDate is going to be 1 since
count(*) counts NULLs. count(u) should probably work.

Yes you are right, I forgot to change COUNT(*) to COUNT(id), as you

mention COUNT(u.*) will also work. I just couldn't get the idea of
generating a sequence form 0 to 6 to work correctly. The approach I'm using
seems to give the correct results (with COUNT(u.id)).

David G. Johnston

david.g.johnston@gmail.com

almost 11 years ago

In reply to: Robert DiFalco (#8)

Re: Average New Users Per DOW

On Mon, Jul 6, 2015 at 4:40 PM, Robert DiFalco <robert.difalco@gmail.com>
wrote:

I am fairly certain this does not give you the correct results.

Specifically, the minimum value for each cDate is going to be 1 since
count(*) counts NULLs. count(u) should probably work.

Yes you are right, I forgot to change COUNT(*) to COUNT(id), as you

mention COUNT(u.*) will also work. I just couldn't get the idea of
generating a sequence form 0 to 6 to work correctly. The approach I'm using
seems to give the correct results (with COUNT(u.id)).

Correct. generate_series(0,6) won't work since there is no context as
whether it is supposed to cover a single week or multiple years or anything
in between.

Any non-null column can be supplied to the count() function: count ignores
nulls. In this case you want to ignore the placeholder null that you are
creating during the left join. My original suggestion avoided these extra
placeholder values and instead forces you to process the master date range
and the user-by-date pieces separately and then substitute 0 for any master
date where the corresponding user-by-date was missing. If performance were
important it may be worth testing both versions otherwise my guess is this
version is more readable (for you).

David J.

#10

Michael Nolan

htfoot@gmail.com

almost 11 years ago

In reply to: Robert DiFalco (#1)

Re: Average New Users Per DOW

On 7/6/15, Robert DiFalco <robert.difalco@gmail.com> wrote:

I'm not sure how to create a result where I get the average number of new
users per day of the week. My issues are that days that did not have any
new users will not be factored into the average, giving an overinflated
result.

This is what I started with:

WITH userdays AS
(SELECT u.created::DATE AS created,
to_char(u.created,'Dy') AS d,
COUNT(*) AS total
FROM users u
GROUP BY 1,2),
userdays_avg AS
(SELECT extract('dow'
FROM created) AS nDay,
d AS "Day",
AVG(total) AS "New Users"
FROM userdays
GROUP BY 1,2
ORDER BY 1)
SELECT "Day", "New Users"
FROM userdays_avg
ORDER BY nDay;

But you can see it wont give correct results since (for example) Monday's
with no new users will not be counted in the average as 0.

One way to handle this is to union your query with one that has a
generate_series (0,6) for the DOW column and nulls for the other
columns, then treat both that and your original query as a subquery
and do your averages, since nulls are not included in either count()
or average() aggregates:

select dow, count(*), avg(some_column) from (
select extract ('dow' from some_date) as dow, some_number from some_table
union select generate_series(0,6) as dow, null as some_number) as x
group by 1 order by 1

--
Mike Nolan
nolan@tssi.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

David G. Johnston

david.g.johnston@gmail.com

almost 11 years ago

In reply to: Michael Nolan (#10)

Re: Average New Users Per DOW

On Mon, Jul 6, 2015 at 6:16 PM, Michael Nolan <htfoot@gmail.com> wrote:

But you can see it wont give correct results since (for example) Monday's
with no new users will not be counted in the average as 0.

One way to handle this is to union your query with one that has a
generate_series (0,6) for the DOW column and nulls for the other
columns, then treat both that and your original query as a subquery
and do your averages, since nulls are not included in either count()
or average() aggregates:

select dow, count(*), avg(some_column) from (
select extract ('dow' from some_date) as dow, some_number from some_table
union select generate_series(0,6) as dow, null as some_number) as x
group by 1 order by 1

I'm not seeing how this is at all useful.

As you said, the average function ignores the null introduced by the union
so the final answer with and without the union is the same.

No matter how you work a "generate_series(0,6)" based query it will never
be able to give a correct answer expect accidentally. Each actual missing
date contributes a ZERO to the numerator and a ONE to the denominator in
the final division that constitutes the mean-average. You must have those
dates.

In a series with four Mondays ( 4, 0, 0, 8 ) the average desired is 3, not
6 (or 4). There is no way to make the denominator (number of Mondays) 4
instead of 3 by using generate_series(0,6).

David J.

#12

Michael Nolan

htfoot@gmail.com

almost 11 years ago

In reply to: David G. Johnston (#11)

Re: Average New Users Per DOW

On Mon, Jul 6, 2015 at 5:50 PM, David G. Johnston <
david.g.johnston@gmail.com> wrote:

On Mon, Jul 6, 2015 at 6:16 PM, Michael Nolan <htfoot@gmail.com> wrote:

But you can see it wont give correct results since (for example)

Monday's

with no new users will not be counted in the average as 0.

One way to handle this is to union your query with one that has a
generate_series (0,6) for the DOW column and nulls for the other
columns, then treat both that and your original query as a subquery
and do your averages, since nulls are not included in either count()
or average() aggregates:

select dow, count(*), avg(some_column) from (
select extract ('dow' from some_date) as dow, some_number from some_table
union select generate_series(0,6) as dow, null as some_number) as x
group by 1 order by 1

I'm not seeing how this is at all useful.

As you said, the average function ignores the null introduced by the union
so the final answer with and without the union is the same.

No matter how you work a "generate_series(0,6)" based query it will never
be able to give a correct answer expect accidentally. Each actual missing
date contributes a ZERO to the numerator and a ONE to the denominator in
the final division that constitutes the mean-average. You must have those
dates.

In a series with four Mondays ( 4, 0, 0, 8 ) the average desired is 3, not
6 (or 4). There is no way to make the denominator (number of Mondays) 4
instead of 3 by using generate_series(0,6).

David J.

Ah, you're right. The problem is that avg() is going to treat missing data
as missing (of course.) It will either be necessary to add in the missing
days as a zero value (but ONLY the missing days, requiring some kind of
'not exists' select, I suppose) or to 'roll your own' average function by
adding in the missing days as I did with a union in my earlier post.

The real problem is the DOW is not the field where the missing data is, it
is in the underlying date field.

I created a test dataset. It has 1 day missing in a two-week period from
June 1st through June 14th (Sunday, June 7th). Here's what the OP's SQL
generates:

Day       New Users
--- ----------------------
Sun     2.0000000000000000
Mon     4.5000000000000000
Tue     2.0000000000000000
Wed     4.5000000000000000
Thu 1.00000000000000000000
Fri     3.0000000000000000
Sat     3.0000000000000000

Here's the SQL to generate the missing day and do the average function by
hand:

select "Day", "New Users" from (
select dow, "Day", sum(total) / count(distinct created) as "New Users"from
(select extract(dow from created) as dow,
to_char(created,'Dy') as "Day", created, created2, total from

(select created, created as created2, count(*) as total from users
group by 1, 2
union
(select generate_series('2015-06-01 00:00'::timestamp,
'2015-06-14'::timestamp,'1 day')::date, null, 0) ) as x) as y
group by 1, 2) as z
order by dow

Day       New Users
--- ----------------------
Sun 1.00000000000000000000
Mon     4.5000000000000000
Tue     2.0000000000000000
Wed     4.5000000000000000
Thu 1.00000000000000000000
Fri     3.0000000000000000
Sat     3.0000000000000000

--
Mike Nolan
nolan@tssi.com

#13

Michael Nolan

htfoot@gmail.com

almost 11 years ago

In reply to: Michael Nolan (#12)

Re: Average New Users Per DOW

Here's a minor refinement that doesn't require knowing the range of dates
in the users table:

(select created, created as created2, count(*) as total from users
group by 1, 2
union
(select generate_series(
(select min(created)::timestamp from users),
(select max(created)::timestamp from users),
'1 day')::date, null, 0) ) as x) as y
group by 1, 2) as z
order by dow

Day       New Users
--- ----------------------
Sun 1.00000000000000000000
Mon     4.5000000000000000
Tue     2.0000000000000000
Wed     4.5000000000000000
Thu 1.00000000000000000000
Fri     3.0000000000000000
Sat     3.0000000000000000

--
Mike Nolan