newsfeed type query

Started by Jonathan Vanascoalmost 11 years ago14 messagesgeneral
Jump to latest
#1Jonathan Vanasco
postgres@2xlp.com

I'm trying to upgrade some code that powers a newfeed type stream, and hoping someone can offer some insight on better ways to structure some parts of the query

The part that has me stumped right now...

There are several criteria for why something could appear in a stream. for example, here are 2 handling a posting:

* a posting by a friend
* a posting in a group

the general way I've handled this so far has been simple:

select * from posting where author_id in (select friend_id from friends where user_id = ?) or group_id in (select group_id from memberships where user_id = ?);

now i need to pull in the context of the match (friend, group, both), but I can't figure out how to do this cleanly.

1. if i just add 'case' statements to the select to note the origin, those subselects run again. (ie, the same subquery is executed twice)
2. if i structure this as a union (and note the origin with a string), it takes a lot more work to integrate and sort the 2 separate selects ( eg "select id, timestamp, 'by-friend'" unioned with "in-group")

does anyone have ideas on other approaches to structuring this?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Melvin Davidson
melvin6925@gmail.com
In reply to: Jonathan Vanasco (#1)
Re: newsfeed type query

Since you very nicely DID NOT provide the pg version, O/S or table
structure(s), which is what you should do REGARDLESS of the
type of question (it's just the smart and polite thing to do when asking
for help) The best I can suggest is:
SELECT
CASE WHEN context = 'friend' THEN p.junka
WHEN context = 'group' THEN p.junkb
WHEN context = 'both' THEN p.junka || ' ' || p.junkb
END
FROM posting p
where p.author_id in (SELECT f.friend_id
FROM friends f
WHERE f.user_id = ?)
OR p.group_id in (SELECT m.group_id
FROM memberships m
WHERE m.user_id = ?);

On Tue, Apr 28, 2015 at 6:26 PM, Jonathan Vanasco <postgres@2xlp.com> wrote:

I'm trying to upgrade some code that powers a newfeed type stream, and
hoping someone can offer some insight on better ways to structure some
parts of the query

The part that has me stumped right now...

There are several criteria for why something could appear in a stream.
for example, here are 2 handling a posting:

* a posting by a friend
* a posting in a group

the general way I've handled this so far has been simple:

select * from posting where author_id in (select friend_id from
friends where user_id = ?) or group_id in (select group_id from memberships
where user_id = ?);

now i need to pull in the context of the match (friend, group, both), but
I can't figure out how to do this cleanly.

1. if i just add 'case' statements to the select to note the origin, those
subselects run again. (ie, the same subquery is executed twice)
2. if i structure this as a union (and note the origin with a string), it
takes a lot more work to integrate and sort the 2 separate selects ( eg
"select id, timestamp, 'by-friend'" unioned with "in-group")

does anyone have ideas on other approaches to structuring this?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

#3Jonathan Vanasco
postgres@2xlp.com
In reply to: Melvin Davidson (#2)
Re: newsfeed type query

Sorry, I was trying to ask something very abstract as I have similar situations on multiple groups of queries/tables (and they're all much more complex).

I'm on pg 9.3

The relevant structure is:

posting:
id
timestamp_publish
group_id__in
user_id__author

friends:
user_id__a
user_id__b

memberships:
user_id
group_id
role_id

-- working sql
CREATE TABLE groups(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE users(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE friends (
user_id__a INT NOT NULL REFERENCES users( id ),
user_id__b INT NOT NULL REFERENCES users( id )
);
CREATE TABLE memberships (
user_id INT NOT NULL REFERENCES users( id ),
group_id INT NOT NULL REFERENCES groups( id ),
role_id INT NOT NULL
);
CREATE TABLE posting (
id SERIAL NOT NULL,
timestamp_publish timestamp not null,
group_id__in INT NOT NULL REFERENCES groups(id),
user_id__author INT NOT NULL REFERENCES users(id),
is_published BOOL
);

The output that I'm trying to get is:
posting.id
{the context of the select}
posting.timestamp_publish (this may need to get correlated into other queries)

These approaches had bad performance:

-- huge selects / memory
-- it needs to load everything from 2 tables before it limits
EXPLAIN ANALYZE
SELECT id, feed_context FROM (
SELECT id, timestamp_publish, 'in-group' AS feed_context FROM posting
WHERE (
group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3))
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC')
)
UNION
SELECT id, timestamp_publish, 'by-user' AS feed_context FROM posting
WHERE (
user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57)
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC')
)
) AS feed
ORDER BY timestamp_publish DESC
LIMIT 10
;

-- selects minimized, but repetitive subqueries
SELECT
id,
CASE
WHEN group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3)) THEN True
ELSE NULL
END AS feed_context_group,
CASE
WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57) THEN True
ELSE NULL
END AS feed_context_user
FROM posting
WHERE (
group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3))
OR
user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57)
)
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC')
ORDER BY timestamp_publish DESC
LIMIT 10
;

On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote:

Show quoted text

Since you very nicely DID NOT provide the pg version, O/S or table structure(s), which is what you should do REGARDLESS of the
type of question (it's just the smart and polite thing to do when asking for help) The best I can suggest is:
SELECT
CASE WHEN context = 'friend' THEN p.junka
WHEN context = 'group' THEN p.junkb
WHEN context = 'both' THEN p.junka || ' ' || p.junkb
END
FROM posting p
where p.author_id in (SELECT f.friend_id
FROM friends f
WHERE f.user_id = ?)
OR p.group_id in (SELECT m.group_id
FROM memberships m
WHERE m.user_id = ?);

#4Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jonathan Vanasco (#3)
Re: newsfeed type query

On 4/28/15 6:57 PM, Jonathan Vanasco wrote:

The relevant structure is:

posting:
id
timestamp_publish
group_id__in
user_id__author
friends:
user_id__a
user_id__b

memberships:
user_id
group_id
role_id

Try this...

SELECT ...
, f.user_id__b IS NOT NULL AS in_friends
, m.user_id IS NOT NULL AS in_group
FROM posting p
LEFT JOIN friends f ON( f.user_id__b = p.user_id__author )
LEFT JOIN memberships m ON( m.group_id = p.group_id__in )
WHERE is_published AND timestamp_publish ...
AND (
f.user_id__a = 57
OR ( m.user_id = 57 AND m.group_id IN (1,2,3) )
)

I'm not sure how fast it'll be though. I suspect your best bet is to put
the UNION approach inside a set returning function; that way the ugly is
contained in one place.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Ladislav Lenart
lenartlad@volny.cz
In reply to: Jonathan Vanasco (#1)
Re: newsfeed type query

Hello.

On 29.4.2015 00:26, Jonathan Vanasco wrote:

I'm trying to upgrade some code that powers a newfeed type stream, and hoping someone can offer some insight on better ways to structure some parts of the query

The part that has me stumped right now...

There are several criteria for why something could appear in a stream. for example, here are 2 handling a posting:

* a posting by a friend
* a posting in a group

the general way I've handled this so far has been simple:

select * from posting where author_id in (select friend_id from friends where user_id = ?) or group_id in (select group_id from memberships where user_id = ?);

now i need to pull in the context of the match (friend, group, both), but I can't figure out how to do this cleanly.

1. if i just add 'case' statements to the select to note the origin, those subselects run again. (ie, the same subquery is executed twice)
2. if i structure this as a union (and note the origin with a string), it takes a lot more work to integrate and sort the 2 separate selects ( eg "select id, timestamp, 'by-friend'" unioned with "in-group")

does anyone have ideas on other approaches to structuring this?

Dunno if this is optimal (most probably not), but it might be of some help
(WARNING: not tested at all):

with
posting_ids as (
select
t.posting_id,
bool_or(t.from_friend) as from_friend,
bool_or(t.grom_group) as from_group
from (
select
posting.id as posting_id,
true as from_friend,
false as from_group
from posting
where posting.author_id in (
select friend_id from friends where user_id = ?
)
union all
select
posting.id as posting_id,
false as from_friend,
true as from_group
from posting
where group_id in (
select group_id from memberships where user_id = ?
)
) t
group by t.posting_id
)
select
posting.*,
posting_ids.from_friend,
posting_ids.from_group,
posting_ids.from_friend or posting_ids.from_group as from_any,
posting_ids.from_friend and posting_ids.from_group as from_both
from
posting
join posting_ids on posting.id = posting_ids.posting_id

Ladislav Lenart

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Ladislav Lenart
lenartlad@volny.cz
In reply to: Jonathan Vanasco (#3)
Re: newsfeed type query

Hello.

On 29.4.2015 01:57, Jonathan Vanasco wrote:

Sorry, I was trying to ask something very abstract as I have similar situations
on multiple groups of queries/tables (and they're all much more complex).

I'm on pg 9.3

The relevant structure is:

posting:
id
timestamp_publish
group_id__in
user_id__author

friends:
user_id__a
user_id__b

memberships:
user_id
group_id
role_id

-- working sql
CREATE TABLE groups(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE users(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE friends (
user_id__a INT NOT NULL REFERENCES users( id ),
user_id__b INT NOT NULL REFERENCES users( id )
);
CREATE TABLE memberships (
user_id INT NOT NULL REFERENCES users( id ),
group_id INT NOT NULL REFERENCES groups( id ),
role_id INT NOT NULL
);
CREATE TABLE posting (
id SERIAL NOT NULL,
timestamp_publish timestamp not null,
group_id__in INT NOT NULL REFERENCES groups(id),
user_id__author INT NOT NULL REFERENCES users(id),
is_published BOOL
);

The output that I'm trying to get is:
posting.id
{the context of the select}
posting.timestamp_publish (this may need to get correlated into other queries)

These approaches had bad performance:

-- huge selects / memory
-- it needs to load everything from 2 tables before it limits
EXPLAIN ANALYZE
SELECT id, feed_context FROM (
SELECT id, timestamp_publish, 'in-group' AS feed_context FROM posting
WHERE (
group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57
AND role_id IN (1,2,3))
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP
AT TIME ZONE 'UTC')
)
UNION
SELECT id, timestamp_publish, 'by-user' AS feed_context FROM posting
WHERE (
user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a
= 57)
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP
AT TIME ZONE 'UTC')
)
) AS feed
ORDER BY timestamp_publish DESC
LIMIT 10
;

I think you can propagate ORDER BY and LIMIT also to the subqueries of the
UNION, i.e.:

select...
from (
(
select...
from posting
where... -- friends
order by timestamp_publish desc
limit 10
) union (
(
select...
from posting
where... -- groups
order by timestamp_publish desc
limit 10
)
) as feed
order by timestamp_publish desc
limit 10

That might behave better.

Ladislav Lenart

-- selects minimized, but repetitive subqueries
SELECT
id,
CASE
WHEN group_id__in IN (SELECT group_id FROM memberships WHERE user_id
= 57 AND role_id IN (1,2,3)) THEN True
ELSE NULL
END AS feed_context_group,
CASE
WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE
user_id__a = 57) THEN True
ELSE NULL
END AS feed_context_user
FROM posting
WHERE (
group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57
AND role_id IN (1,2,3))
OR
user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a
= 57)
)
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT
TIME ZONE 'UTC')
ORDER BY timestamp_publish DESC
LIMIT 10
;

On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote:

Since you very nicely DID NOT provide the pg version, O/S or table
structure(s), which is what you should do REGARDLESS of the
type of question (it's just the smart and polite thing to do when asking for
help) The best I can suggest is:
SELECT
CASE WHEN context = 'friend' THEN p.junka
WHEN context = 'group' THEN p.junkb
WHEN context = 'both' THEN p.junka || ' ' || p.junkb
END
FROM posting p
where p.author_id in (SELECT f.friend_id
FROM friends f
WHERE f.user_id = ?)
OR p.group_id in (SELECT m.group_id
FROM memberships m
WHERE m.user_id = ?);

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Melvin Davidson
melvin6925@gmail.com
In reply to: Ladislav Lenart (#6)
Re: newsfeed type query

I see others have responded with suggestions to improve query performance,
but one thing I noticed when you gave the data structure is there are no
no primary keys defined for friends or posting, neither are there any
indexes.
Was that an omission? If not, then please note that PostgreSQL is a
_relational_
database and it is critical to have primary keys and additional indexes for
data integrity
and performance.

FYI, defining a foreign key in a table does not automatically generate an
associated
index.

I therefore suggest you do the following to improve performance.

ALTER TABLE posting
ADD CONSTRAINT posting_pk PRIMARY KEY ( id );

ALTER TABLE friends
ADD CONSTRAINT friends_pk PRIMARY KEY ( user_id__a, user_id__b );

ALTER TABLE membership
ADD CONSTRAINT membership_pk PRIMARY KEY ( user_id, group_id );

CREATE INDEX posting_group_id__in_idx
on posting
USING BTREE ( group_id__in );

ANALYZE posting;
ANALYZE friends;
ANALYZE membership;

On Wed, Apr 29, 2015 at 6:54 AM, Ladislav Lenart <lenartlad@volny.cz> wrote:

Hello.

On 29.4.2015 01:57, Jonathan Vanasco wrote:

Sorry, I was trying to ask something very abstract as I have similar

situations

on multiple groups of queries/tables (and they're all much more complex).

I'm on pg 9.3

The relevant structure is:

posting:
id
timestamp_publish
group_id__in
user_id__author

friends:
user_id__a
user_id__b

memberships:
user_id
group_id
role_id

-- working sql
CREATE TABLE groups(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE users(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE friends (
user_id__a INT NOT NULL REFERENCES users( id ),
user_id__b INT NOT NULL REFERENCES users( id )
);
CREATE TABLE memberships (
user_id INT NOT NULL REFERENCES users( id ),
group_id INT NOT NULL REFERENCES groups( id ),
role_id INT NOT NULL
);
CREATE TABLE posting (
id SERIAL NOT NULL,
timestamp_publish timestamp not null,
group_id__in INT NOT NULL REFERENCES groups(id),
user_id__author INT NOT NULL REFERENCES users(id),
is_published BOOL
);

The output that I'm trying to get is:
posting.id
{the context of the select}
posting.timestamp_publish (this may need to get correlated into

other queries)

These approaches had bad performance:

-- huge selects / memory
-- it needs to load everything from 2 tables before it limits
EXPLAIN ANALYZE
SELECT id, feed_context FROM (
SELECT id, timestamp_publish, 'in-group' AS feed_context FROM

posting

WHERE (
group_id__in IN (SELECT group_id FROM memberships WHERE

user_id = 57

AND role_id IN (1,2,3))
AND (is_published = True AND timestamp_publish <=

CURRENT_TIMESTAMP

AT TIME ZONE 'UTC')
)
UNION
SELECT id, timestamp_publish, 'by-user' AS feed_context FROM

posting

WHERE (
user_id__author IN (SELECT user_id__b FROM friends WHERE

user_id__a

= 57)
AND (is_published = True AND timestamp_publish <=

CURRENT_TIMESTAMP

AT TIME ZONE 'UTC')
)
) AS feed
ORDER BY timestamp_publish DESC
LIMIT 10
;

I think you can propagate ORDER BY and LIMIT also to the subqueries of the
UNION, i.e.:

select...
from (
(
select...
from posting
where... -- friends
order by timestamp_publish desc
limit 10
) union (
(
select...
from posting
where... -- groups
order by timestamp_publish desc
limit 10
)
) as feed
order by timestamp_publish desc
limit 10

That might behave better.

Ladislav Lenart

-- selects minimized, but repetitive subqueries
SELECT
id,
CASE
WHEN group_id__in IN (SELECT group_id FROM memberships WHERE

user_id

= 57 AND role_id IN (1,2,3)) THEN True
ELSE NULL
END AS feed_context_group,
CASE
WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE
user_id__a = 57) THEN True
ELSE NULL
END AS feed_context_user
FROM posting
WHERE (
group_id__in IN (SELECT group_id FROM memberships WHERE

user_id = 57

AND role_id IN (1,2,3))
OR
user_id__author IN (SELECT user_id__b FROM friends WHERE

user_id__a

= 57)
)
AND (is_published = True AND timestamp_publish <=

CURRENT_TIMESTAMP AT

TIME ZONE 'UTC')
ORDER BY timestamp_publish DESC
LIMIT 10
;

On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote:

Since you very nicely DID NOT provide the pg version, O/S or table
structure(s), which is what you should do REGARDLESS of the
type of question (it's just the smart and polite thing to do when

asking for

help) The best I can suggest is:
SELECT
CASE WHEN context = 'friend' THEN p.junka
WHEN context = 'group' THEN p.junkb
WHEN context = 'both' THEN p.junka || ' ' || p.junkb
END
FROM posting p
where p.author_id in (SELECT f.friend_id
FROM friends f
WHERE f.user_id = ?)
OR p.group_id in (SELECT m.group_id
FROM memberships m
WHERE m.user_id = ?);

--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

#8Jonathan Vanasco
postgres@2xlp.com
In reply to: Ladislav Lenart (#6)
Re: newsfeed type query

Thanks all! These point me in much better directions!

Jim Nasby's approach to selecting an expression addressed some things (SELECT f.useraccount_id_b IS NOT NULL AS in_friends)

Ladislav Lenart's usage of the CTE is also of a different format that I've used in the past.

I think i'll be able to patch together some performance improvements now, that will last until the database structure changes.

On Apr 29, 2015, at 6:54 AM, Ladislav Lenart wrote:

I think you can propagate ORDER BY and LIMIT also to the subqueries of the
UNION, i.e.:

It behaves a lot better, but doesn't give me the resultset I need. Older data from one subquery is favored to newer data from another

I use a similar approach on another part of this application -- where the effect on the resultset isn't as pronounced.
On that query there are over 100 million total stream events. Not using an inner limit runs the query in 7 minutes; limiting the inner subquery to 1MM runs in 70 seconds... and limiting to 10k is around 100ms.

On Apr 29, 2015, at 10:16 AM, Melvin Davidson wrote:

I see others have responded with suggestions to improve query performance,
but one thing I noticed when you gave the data structure is there are no
no primary keys defined for friends or posting, neither are there any indexes.
Was that an omission?

This was a quick functional example to illustrate. The real tables are slightly different but do have pkeys ( 'id' is a bigserial, relationship tables (friends, memberships) use a composite key ). They are aggressively indexed and reindexed on various columns for query performance. sometimes we create an extra index that has multiple columns or partial-columns to make make scans index-only.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Ladislav Lenart
lenartlad@volny.cz
In reply to: Jonathan Vanasco (#8)
Re: newsfeed type query

Hello.

On 29.4.2015 17:27, Jonathan Vanasco wrote:

Thanks all! These point me in much better directions!

Jim Nasby's approach to selecting an expression addressed some things (SELECT f.useraccount_id_b IS NOT NULL AS in_friends)

Ladislav Lenart's usage of the CTE is also of a different format that I've used in the past.

I think i'll be able to patch together some performance improvements now, that will last until the database structure changes.

On Apr 29, 2015, at 6:54 AM, Ladislav Lenart wrote:

I think you can propagate ORDER BY and LIMIT also to the subqueries of the
UNION, i.e.:

It behaves a lot better, but doesn't give me the resultset I need. Older data from one subquery is favored to newer data from another

Hmm, I don't understand why it should behave like that. Imagine the following
postings (ts is a relative timestamp):

posting ts context
p0 0 friend
p10 10 group
p20 20 friend
p30 30 group
p40 40 friend
p50 50 group
p60 60 friend

and let's say the LIMIT is 2. Then:
* The first subquery (for friends) should return p60 and p40 (in DESC order).
* The second subquery (for groups) should return p50 and p30 (in DESC order).
* The UNION should return p60 and p50.

Could you please explain to me the error(s) in my reasoning?

Thank you,

Ladislav Lenart

I use a similar approach on another part of this application -- where the effect on the resultset isn't as pronounced.
On that query there are over 100 million total stream events. Not using an inner limit runs the query in 7 minutes; limiting the inner subquery to 1MM runs in 70 seconds... and limiting to 10k is around 100ms.

On Apr 29, 2015, at 10:16 AM, Melvin Davidson wrote:

I see others have responded with suggestions to improve query performance,
but one thing I noticed when you gave the data structure is there are no
no primary keys defined for friends or posting, neither are there any indexes.
Was that an omission?

This was a quick functional example to illustrate. The real tables are slightly different but do have pkeys ( 'id' is a bigserial, relationship tables (friends, memberships) use a composite key ). They are aggressively indexed and reindexed on various columns for query performance. sometimes we create an extra index that has multiple columns or partial-columns to make make scans index-only.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Jonathan Vanasco
postgres@2xlp.com
In reply to: Ladislav Lenart (#9)
Re: newsfeed type query

On Apr 29, 2015, at 12:25 PM, Ladislav Lenart wrote:

Could you please explain to me the error(s) in my reasoning?

Let me just flip your list in reverse... and add in some elements (marked with a *):

posting ts context
p60 60 friend
p55 55 friend*
p54 54 friend*
p50 50 group
p50 49 group*
p50 49 group*
p40 40 friend
p30 30 group
p20 20 friend
p10 10 group
p0 0 friend

With the 2 limited subqueries, the results would be:
60F, 55F, 50G, 49G

But the "most recent" data is
50F, 55F, 54F, 50G

So we end up showing 49 which is less relevant than 54.

In some situations this isn't much of an issue, but in others it is detrimental.
For example, one of my "feeds" contains a distribution of events according-to-type that is very uneven. While "friend" and "group" might be relatively close in time to one another, "system" or other events may be months old -- and that older content gets pulled in with this style of query.

If you need to paginate the data and select the next 10 overall items, it gets even more complicated.

IIRC, the best mix of performance and "product" that I've found is do something like this:

SELECT * FROM (
SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 10000;
UNION
SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 10000;
) as unioned
order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0;

by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work and memory usage (like a lot)
then, joining a few lists and sorting 20k (or even 100k) items is really cheap.
the downside is that you effectively limit the 'relevancy' of the query to whatever the inner limit is (ie, 10000 -- not the combined total of 20000), but that number can be arbitrarily high enough that it is irrelevant while still showing the right amount of content for people.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Ladislav Lenart
lenartlad@volny.cz
In reply to: Jonathan Vanasco (#10)
Re: newsfeed type query

On 29.4.2015 18:54, Jonathan Vanasco wrote:

On Apr 29, 2015, at 12:25 PM, Ladislav Lenart wrote:

Could you please explain to me the error(s) in my reasoning?

Let me just flip your list in reverse... and add in some elements (marked with a *):

posting ts context
p60 60 friend
p55 55 friend*
p54 54 friend*
p50 50 group
p50 49 group*
p50 49 group*
p40 40 friend
p30 30 group
p20 20 friend
p10 10 group
p0 0 friend

With the 2 limited subqueries, the results would be:
60F, 55F, 50G, 49G

But the "most recent" data is
50F, 55F, 54F, 50G

So we end up showing 49 which is less relevant than 54.

I would expect the overall query to return only 60F nad 55F as the most recent
data. No? You expect it to return 4 items when the LIMIT is only 2. Remember
that the overall query should be also ordered by ts and limited to 2.

I thought you want most recent items across all contexts and not 2 most recent
items from friends plus two most recent items from groups...

Ladislav Lenart

In some situations this isn't much of an issue, but in others it is detrimental.
For example, one of my "feeds" contains a distribution of events according-to-type that is very uneven. While "friend" and "group" might be relatively close in time to one another, "system" or other events may be months old -- and that older content gets pulled in with this style of query.

If you need to paginate the data and select the next 10 overall items, it gets even more complicated.

IIRC, the best mix of performance and "product" that I've found is do something like this:

SELECT * FROM (
SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 10000;
UNION
SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 10000;
) as unioned
order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0;

by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work and memory usage (like a lot)
then, joining a few lists and sorting 20k (or even 100k) items is really cheap.
the downside is that you effectively limit the 'relevancy' of the query to whatever the inner limit is (ie, 10000 -- not the combined total of 20000), but that number can be arbitrarily high enough that it is irrelevant while still showing the right amount of content for people.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jonathan Vanasco (#10)
Re: newsfeed type query

On 4/29/15 11:54 AM, Jonathan Vanasco wrote:

IIRC, the best mix of performance and "product" that I've found is do something like this:

SELECT * FROM (
SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 10000;
UNION
SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 10000;
) as unioned
order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0;

by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work and memory usage (like a lot)
then, joining a few lists and sorting 20k (or even 100k) items is really cheap.

Only because you're using UNION. Use UNION ALL instead.

Also, you mentioned CTEs. Be aware that those are ALWAYS materialized.
Sometimes that helps performance... sometimes it hurts it horribly. I
stick with embedded subselects unless I need a specific CTE feature.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Jonathan Vanasco
postgres@2xlp.com
In reply to: Jim Nasby (#12)
Re: newsfeed type query

On Apr 29, 2015, at 6:50 PM, Jim Nasby wrote:

Only because you're using UNION. Use UNION ALL instead.

The difference between "union" and "union all" was negligible. the problem was in the subselect and the sheer size of the tables, even when we could handle it as an index-only scan.

On Apr 29, 2015, at 1:18 PM, Ladislav Lenart wrote:

I would expect the overall query to return only 60F nad 55F as the most recent
data. No? You expect it to return 4 items when the LIMIT is only 2. Remember
that the overall query should be also ordered by ts and limited to 2.

You're right. total mistake on my part and confusion with that. I got this query confused with the specifics of a similar one.

#14Ladislav Lenart
lenartlad@volny.cz
In reply to: Jonathan Vanasco (#13)
Re: newsfeed type query

On 30.4.2015 19:08, Jonathan Vanasco wrote:

On Apr 29, 2015, at 6:50 PM, Jim Nasby wrote:

Only because you're using UNION. Use UNION ALL instead.

The difference between "union" and "union all" was negligible. the problem was
in the subselect and the sheer size of the tables, even when we could handle it
as an index-only scan.

On Apr 29, 2015, at 1:18 PM, Ladislav Lenart wrote:

I would expect the overall query to return only 60F nad 55F as the most recent
data. No? You expect it to return 4 items when the LIMIT is only 2. Remember
that the overall query should be also ordered by ts and limited to 2.

You're right. total mistake on my part and confusion with that. I got this
query confused with the specifics of a similar one.

OK :-) Have you managed to solve the problem then? I am interested in your final
solution.

Thank you,

Ladislav Lenart

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general