Distinct on a non-sort column

Started by Cstdenisover 14 years ago7 messagesgeneral
Jump to latest
#1Cstdenis
lists@on-track.ca

I am trying to write a query that selects recent submissions (sorted by
submission_date) but only selects the most recent one for each user_id.

example query: /select distinct on (user_id) * from stories order by
date_submitted desc limit 10;/

However postgres will not allow me to filter out duplicate rows with
distinct unless I sort on that column, which would product useless
results for me. Group by seems to have similiar problems, plus the
additional problem of wanting aggregate functions to be used.

I even tried sorting in a subquery, but it still comes out sorted by
user_id: /select distinct on (user_id) * from stories where sid in
(select sid from stories order by date_submitted desc limit 10);/

How can I work around this limitation to get the results sorted the way
I want, then have the duplicates removed? I can't be the only one
running into this limitation, there must be some workaround.

#2Tair Sabirgaliev
tair.sabirgaliev@bee.kz
In reply to: Cstdenis (#1)
Re: Distinct on a non-sort column

On Sun, Nov 6, 2011 at 12:39 AM, Cstdenis <lists@on-track.ca> wrote:

I am trying to write a query that selects recent submissions (sorted by
submission_date) but only selects the most recent one for each user_id.

example query: select distinct on (user_id) * from stories order by
date_submitted desc limit 10;

However postgres will not allow me to filter out duplicate rows with
distinct unless I sort on that column, which would product useless results
for me. Group by seems to have similiar problems, plus the additional
problem of wanting aggregate functions to be used.

I even tried sorting in a subquery, but it still comes out sorted by
user_id: select distinct on (user_id) * from stories where sid in (select
sid from stories order by date_submitted desc limit 10);

How can I work around this limitation to get the results sorted the way I
want, then have the duplicates removed? I can't be the only one running into
this limitation, there must be some workaround.

assuming date_submitted are unique for user_id:
select * from stories s, (select user_id, max(date_submitted) d from
stories group by user_id) ss where s.user_id = ss.user_id and
s.date_submitted = ss.date_submitted;

--
с уважением,
Таир Сабыргалиев
ТОО "BEE Software"
Республика Казахстан, 010000
г.Астана, ул.Сарайшык 34, ВП-27
Тел.: +7 (7172) 56-89-31
Сот.: +7 (702) 2173359
e-mail: tair.sabirgaliev@bee.kz
Tair Sabirgaliev
"BEE Software" Ltd.
Republic of Kazakhstan, 010000
Astana, Sarayshyk str. 34, sect. 27
Tel.: +7 (7172) 56-89-31
Mob.: +7 (702) 2173359
e-mail: tair.sabirgaliev@bee.kz

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Cstdenis (#1)
Re: Distinct on a non-sort column

Cstdenis <lists@on-track.ca> writes:

I am trying to write a query that selects recent submissions (sorted by
submission_date) but only selects the most recent one for each user_id.

example query: /select distinct on (user_id) * from stories order by
date_submitted desc limit 10;/

However postgres will not allow me to filter out duplicate rows with
distinct unless I sort on that column, which would product useless
results for me.

Do the DISTINCT ON in a sub-query, with an ORDER BY appropriate for that
task, and then re-sort the rows the way you want them presented in the
outer query.

SELECT ... FROM
(SELECT DISTINCT ON ... ORDER BY ...) ss
ORDER BY ...;

regards, tom lane

#4John R Pierce
pierce@hogranch.com
In reply to: Cstdenis (#1)
Re: Distinct on a non-sort column

On 11/05/11 11:39 AM, Cstdenis wrote:

example query: /select distinct on (user_id) * from stories order by
date_submitted desc limit 10;/

select user_id,max(date_submitted) from stories group by date_submitted;

?

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

#5Cstdenis
lists@on-track.ca
In reply to: Tom Lane (#3)
Re: Distinct on a non-sort column

On 11/5/2011 12:49 PM, Tom Lane wrote:

Cstdenis<lists@on-track.ca> writes:

I am trying to write a query that selects recent submissions (sorted by
submission_date) but only selects the most recent one for each user_id.
example query: /select distinct on (user_id) * from stories order by
date_submitted desc limit 10;/
However postgres will not allow me to filter out duplicate rows with
distinct unless I sort on that column, which would product useless
results for me.

Do the DISTINCT ON in a sub-query, with an ORDER BY appropriate for that
task, and then re-sort the rows the way you want them presented in the
outer query.

SELECT ... FROM
(SELECT DISTINCT ON ... ORDER BY ...) ss
ORDER BY ...;

regards, tom lane

If I understand that you are proposing as

select * from
(select distinct on (user_id) * from stories as s order
by user_id) as foo
order by date_submitted desc limit 10;

I think it has the problem of which of the stories by that user is
selected is random rather than the most recent being guaranteed (because
the distinct is done before the sort). Or am I misunderstanding this?

The suggestions by others of using max(date_submitted) may be a good
workaround for this, but I also need to do the same thing sorted by a
calculated score value which I do not think will be sufficiently unique
for Tair's suggestion.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Cstdenis (#5)
Re: Distinct on a non-sort column

Cstdenis <lists@on-track.ca> writes:

If I understand that you are proposing as

select * from
(select distinct on (user_id) * from stories as s order
by user_id) as foo
order by date_submitted desc limit 10;

No, you always need to sort by *more* columns than are listed in
DISTINCT ON. That's what determines which row is picked in each
DISTINCT group. Read the SELECT reference page's example of how
to use DISTINCT ON.

regards, tom lane

#7Cstdenis
lists@on-track.ca
In reply to: Tom Lane (#6)
Re: Distinct on a non-sort column

On 11/5/2011 4:11 PM, Tom Lane wrote:

Cstdenis<lists@on-track.ca> writes:

If I understand that you are proposing as
select * from
(select distinct on (user_id) * from stories as s order
by user_id) as foo
order by date_submitted desc limit 10;

No, you always need to sort by *more* columns than are listed in
DISTINCT ON. That's what determines which row is picked in each
DISTINCT group. Read the SELECT reference page's example of how
to use DISTINCT ON.

regards, tom lane

Thanks. I was missing the obvious. This seems to give the results I need.

It's a shame the query parser isn't able to simply internally process
the query like that -- doing a second sort pass after the distinct
automatically in the case of the sort being on a different column from
distinct (instead of producing an error).