Retrieving multiple columns from a subquery

Started by Chris Hanksalmost 14 years ago5 messagesgeneral
Jump to latest
#1Chris Hanks
christopher.m.hanks@gmail.com

Hello -

I have two tables:

CREATE TABLE users
(
id serial NOT NULL,
created_at timestamp with time zone NOT NULL,
last_seen_at timestamp with time zone NOT NULL,
-- some other columns...
)

CREATE TABLE emails
(
user_id integer NOT NULL,
address text NOT NULL,
created_at timestamp with time zone NOT NULL,
confirmed_at timestamp with time zone,
-- some other columns...
CONSTRAINT emails_user_id_fkey FOREIGN KEY (user_id)
REFERENCES users (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
)

The gist is that a single user can be related to multiple emails, and some
email addresses are confirmed (they've clicked a link I've sent there, so I
know it's valid) and some aren't.

Routinely, when I'm fetching users from the db I also want to get the best
email address for each user. That is, the email address that they've
confirmed the most recently, or failing that, the one that they created the
most recently. I've been doing this via a subselect:

SELECT *,
(SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email"
FROM "users"

I like the subquery approach because I can use my ORM to easily append it
to whatever query I'm running against the users table (whether I'm looking
up one user or many), without having to do an explicit join and trim out
the unnecessary rows. Also, in the future I'm planning on adding additional
subqueries to get (for example) each user's current subscription status,
and I'm afraid that the joins will get ungainly. Besides, I find subqueries
much easier to reason about than joins.

My problem is that now I need to get not only the best email's address, but
whether it is confirmed (whether confirmed_at is not null). My first
attempt was to simply repeat the subquery:

SELECT *,
(SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email",
((SELECT "confirmed_at" FROM "emails" WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) IS NOT NULL) AS
"best_email_confirmed"
FROM "users"

I had hoped Postgres would recognize that the two subqueries were
identical, but judging from the explain output from my development database
it's not, and it's simply running the subquery twice instead:

"Seq Scan on users (cost=0.00..333.65 rows=13 width=81)"
" SubPlan 1"
" -> Limit (cost=12.79..12.79 rows=1 width=48)"
" -> Sort (cost=12.79..12.80 rows=5 width=48)"
" Sort Key: public.emails.confirmed_at,
public.emails.created_at"
" -> Bitmap Heap Scan on emails (cost=4.29..12.76 rows=5
width=48)"
" Recheck Cond: (user_id = users.id)"
" -> Bitmap Index Scan on emails_pkey
(cost=0.00..4.29 rows=5 width=0)"
" Index Cond: (user_id = users.id)"
" SubPlan 2"
" -> Limit (cost=12.79..12.79 rows=1 width=16)"
" -> Sort (cost=12.79..12.80 rows=5 width=16)"
" Sort Key: public.emails.confirmed_at,
public.emails.created_at"
" -> Bitmap Heap Scan on emails (cost=4.29..12.76 rows=5
width=16)"
" Recheck Cond: (user_id = users.id)"
" -> Bitmap Index Scan on emails_pkey
(cost=0.00..4.29 rows=5 width=0)"
" Index Cond: (user_id = users.id)"

It would be ideal if I could pull both results from the same subquery,
something like:

SELECT *,
(SELECT "address", "confirmed_at" IS NOT NULL FROM "emails" WHERE
("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at"
DESC LIMIT 1) AS ("best_email", "best_email_confirmed")
FROM "users"

But this isn't valid syntax. I tried putting the subquery under a FROM
clause, but it won't work with my "user_id" = "id" condition, and throws
"ERROR: subquery in FROM cannot refer to other relations of same query
level". I think CTEs might be an answer, but I'm stuck on 8.3 for the
foreseeable future, which doesn't support them.

Does anyone have any suggestions?

#2Chris Hanks
christopher.m.hanks@gmail.com
In reply to: Chris Hanks (#1)
Re: Retrieving multiple columns from a subquery

Nothing? Are subqueries just not meant to be used this way?

On Wed, May 9, 2012 at 9:42 AM, Chris Hanks
<christopher.m.hanks@gmail.com>wrote:

Show quoted text

Hello -

I have two tables:

CREATE TABLE users
(
id serial NOT NULL,
created_at timestamp with time zone NOT NULL,
last_seen_at timestamp with time zone NOT NULL,
-- some other columns...
)

CREATE TABLE emails
(
user_id integer NOT NULL,
address text NOT NULL,
created_at timestamp with time zone NOT NULL,
confirmed_at timestamp with time zone,
-- some other columns...
CONSTRAINT emails_user_id_fkey FOREIGN KEY (user_id)
REFERENCES users (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
)

The gist is that a single user can be related to multiple emails, and some
email addresses are confirmed (they've clicked a link I've sent there, so I
know it's valid) and some aren't.

Routinely, when I'm fetching users from the db I also want to get the best
email address for each user. That is, the email address that they've
confirmed the most recently, or failing that, the one that they created the
most recently. I've been doing this via a subselect:

SELECT *,
(SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email"
FROM "users"

I like the subquery approach because I can use my ORM to easily append it
to whatever query I'm running against the users table (whether I'm looking
up one user or many), without having to do an explicit join and trim out
the unnecessary rows. Also, in the future I'm planning on adding additional
subqueries to get (for example) each user's current subscription status,
and I'm afraid that the joins will get ungainly. Besides, I find subqueries
much easier to reason about than joins.

My problem is that now I need to get not only the best email's address,
but whether it is confirmed (whether confirmed_at is not null). My first
attempt was to simply repeat the subquery:

SELECT *,
(SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email",
((SELECT "confirmed_at" FROM "emails" WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) IS NOT NULL) AS
"best_email_confirmed"
FROM "users"

I had hoped Postgres would recognize that the two subqueries were
identical, but judging from the explain output from my development database
it's not, and it's simply running the subquery twice instead:

"Seq Scan on users (cost=0.00..333.65 rows=13 width=81)"
" SubPlan 1"
" -> Limit (cost=12.79..12.79 rows=1 width=48)"
" -> Sort (cost=12.79..12.80 rows=5 width=48)"
" Sort Key: public.emails.confirmed_at,
public.emails.created_at"
" -> Bitmap Heap Scan on emails (cost=4.29..12.76 rows=5
width=48)"
" Recheck Cond: (user_id = users.id)"
" -> Bitmap Index Scan on emails_pkey
(cost=0.00..4.29 rows=5 width=0)"
" Index Cond: (user_id = users.id)"
" SubPlan 2"
" -> Limit (cost=12.79..12.79 rows=1 width=16)"
" -> Sort (cost=12.79..12.80 rows=5 width=16)"
" Sort Key: public.emails.confirmed_at,
public.emails.created_at"
" -> Bitmap Heap Scan on emails (cost=4.29..12.76 rows=5
width=16)"
" Recheck Cond: (user_id = users.id)"
" -> Bitmap Index Scan on emails_pkey
(cost=0.00..4.29 rows=5 width=0)"
" Index Cond: (user_id = users.id)"

It would be ideal if I could pull both results from the same subquery,
something like:

SELECT *,
(SELECT "address", "confirmed_at" IS NOT NULL FROM "emails" WHERE
("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at"
DESC LIMIT 1) AS ("best_email", "best_email_confirmed")
FROM "users"

But this isn't valid syntax. I tried putting the subquery under a FROM
clause, but it won't work with my "user_id" = "id" condition, and throws
"ERROR: subquery in FROM cannot refer to other relations of same query
level". I think CTEs might be an answer, but I'm stuck on 8.3 for the
foreseeable future, which doesn't support them.

Does anyone have any suggestions?

#3Merlin Moncure
mmoncure@gmail.com
In reply to: Chris Hanks (#1)
Re: Retrieving multiple columns from a subquery

On Wed, May 9, 2012 at 11:42 AM, Chris Hanks
<christopher.m.hanks@gmail.com> wrote:

Hello -

I have two tables:

CREATE TABLE users
(
  id serial NOT NULL,
  created_at timestamp with time zone NOT NULL,
  last_seen_at timestamp with time zone NOT NULL,
  -- some other columns...
)

CREATE TABLE emails
(
  user_id integer NOT NULL,
  address text NOT NULL,
  created_at timestamp with time zone NOT NULL,
  confirmed_at timestamp with time zone,
  -- some other columns...
  CONSTRAINT emails_user_id_fkey FOREIGN KEY (user_id)
      REFERENCES users (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE
)

The gist is that a single user can be related to multiple emails, and some
email addresses are confirmed (they've clicked a link I've sent there, so I
know it's valid) and some aren't.

Routinely, when I'm fetching users from the db I also want to get the best
email address for each user. That is, the email address that they've
confirmed the most recently, or failing that, the one that they created the
most recently. I've been doing this via a subselect:

SELECT *,
  (SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email"
FROM "users"

I like the subquery approach because I can use my ORM to easily append it to
whatever query I'm running against the users table (whether I'm looking up
one user or many), without having to do an explicit join and trim out the
unnecessary rows. Also, in the future I'm planning on adding additional
subqueries to get (for example) each user's current subscription status, and
I'm afraid that the joins will get ungainly. Besides, I find subqueries much
easier to reason about than joins.

My problem is that now I need to get not only the best email's address, but
whether it is confirmed (whether confirmed_at is not null). My first attempt
was to simply repeat the subquery:

SELECT *,
  (SELECT "address" FROM "emails" WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS "best_email",
  ((SELECT "confirmed_at" FROM "emails" WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) IS NOT NULL) AS
"best_email_confirmed"
FROM "users"

I had hoped Postgres would recognize that the two subqueries were identical,
but judging from the explain output from my development database it's not,
and it's simply running the subquery twice instead:

"Seq Scan on users  (cost=0.00..333.65 rows=13 width=81)"
"  SubPlan 1"
"    ->  Limit  (cost=12.79..12.79 rows=1 width=48)"
"          ->  Sort  (cost=12.79..12.80 rows=5 width=48)"
"                Sort Key: public.emails.confirmed_at,
public.emails.created_at"
"                ->  Bitmap Heap Scan on emails  (cost=4.29..12.76 rows=5
width=48)"
"                      Recheck Cond: (user_id = users.id)"
"                      ->  Bitmap Index Scan on emails_pkey
 (cost=0.00..4.29 rows=5 width=0)"
"                            Index Cond: (user_id = users.id)"
"  SubPlan 2"
"    ->  Limit  (cost=12.79..12.79 rows=1 width=16)"
"          ->  Sort  (cost=12.79..12.80 rows=5 width=16)"
"                Sort Key: public.emails.confirmed_at,
public.emails.created_at"
"                ->  Bitmap Heap Scan on emails  (cost=4.29..12.76 rows=5
width=16)"
"                      Recheck Cond: (user_id = users.id)"
"                      ->  Bitmap Index Scan on emails_pkey
 (cost=0.00..4.29 rows=5 width=0)"
"                            Index Cond: (user_id = users.id)"

It would be ideal if I could pull both results from the same subquery,
something like:

SELECT *,
  (SELECT "address", "confirmed_at" IS NOT NULL FROM "emails" WHERE
("user_id" = "id") ORDER BY "confirmed_at" DESC NULLS LAST, "created_at"
DESC LIMIT 1) AS ("best_email", "best_email_confirmed")
FROM "users"

But this isn't valid syntax. I tried putting the subquery under a FROM
clause, but it won't work with my "user_id" = "id" condition, and throws
"ERROR:  subquery in FROM cannot refer to other relations of same query
level". I think CTEs might be an answer, but I'm stuck on 8.3 for the
foreseeable future, which doesn't support them.

Does anyone have any suggestions?

try this. sometimes, but not always, it will avoid the extra subplans:

SELECT u.*, (emails).* FROM
(
SELECT *,
(SELECT e FROM "emails" e WHERE ("user_id" = "id") ORDER BY
"confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) AS emails,
FROM "users" u
) q

merlin

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chris Hanks (#2)
Re: Retrieving multiple columns from a subquery

Chris Hanks <christopher.m.hanks@gmail.com> writes:

Nothing? Are subqueries just not meant to be used this way?

The SQL standard says not ;-).

You could approximate it like this:

select ..., (select row(x,y,z) from ...), ... from ...;

as long as you don't mind pulling the composite-value output syntax
apart. This avoids the single-output-column syntactic restriction
by cramming all the values into one column.

[ thinks for a bit... ] It seems like you ought to be able to get PG
to pull the composite values apart again, with something like

select ..., (x).*, ... from
(select ..., (select row(x,y,z) from ...) as x, ...
from ... offset 0) ss;

but when I try this I get
ERROR: record type has not been registered
That's a bug, probably, but dunno how hard to fix. In the meantime you
could work around it by casting the row() expression to a named
composite type; which might be a good idea anyway since there's no other
obvious way to control the column names that will be exposed by the
(x).* expansion.

regards, tom lane

#5Chris Hanks
christopher.m.hanks@gmail.com
In reply to: Tom Lane (#4)
Re: Retrieving multiple columns from a subquery

On Mon, May 14, 2012 at 8:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chris Hanks <christopher.m.hanks@gmail.com> writes:

Nothing? Are subqueries just not meant to be used this way?

The SQL standard says not ;-).

You could approximate it like this:

select ..., (select row(x,y,z) from ...), ... from ...;

as long as you don't mind pulling the composite-value output syntax
apart. This avoids the single-output-column syntactic restriction
by cramming all the values into one column.

[ thinks for a bit... ] It seems like you ought to be able to get PG
to pull the composite values apart again, with something like

select ..., (x).*, ... from
(select ..., (select row(x,y,z) from ...) as x, ...
from ... offset 0) ss;

but when I try this I get
ERROR: record type has not been registered
That's a bug, probably, but dunno how hard to fix. In the meantime you
could work around it by casting the row() expression to a named
composite type; which might be a good idea anyway since there's no other
obvious way to control the column names that will be exposed by the
(x).* expansion.

regards, tom lane

Thanks, I tried playing with the row function a bit. It gave me the idea to
try:

SELECT *, (SELECT ARRAY[address, (confirmed_at is not null)::text]
FROM "emails"
WHERE ("user_id" = "id")
ORDER BY "confirmed_at" DESC NULLS LAST, "created_at" DESC LIMIT 1) as
best_email
FROM "users"

Since my ORM already handles Postgres arrays for me, this winds up being a
bit easier to handle in my app. It's a bit ugly, but it works. I'll keep
the idea of the named composite type around in case I need to revisit this
later, though.

Thanks again for the advice!