BUG #14020: row_number() over(partition by order by) - weird behavior

Started by boyko yordanovabout 10 years ago5 messagesbugs
Jump to latest
#1boyko yordanov
b.yordanov2@gmail.com

The following bug has been logged on the website:

Bug reference: 14020
Logged by: Boyko
Email address: b.yordanov2@gmail.com
PostgreSQL version: 9.5.0
Operating system: CentOS 6.4
Description:

Hi,

db=# \d+ offers_past_data;
Table "public.offers_past_data"
Column | Type | Modifiers | Storage | Stats target | Description
----------+---------+-----------+---------+--------------+-------------
id | integer | not null | plain | |
feed | integer | not null | plain | |
position | integer | not null | plain | |

db=# \d+ offers_testing;
Table
"public.offers_testing"
Column | Type |
Modifiers | Storage | Stats target | Description
--------------------+-----------------------------+-----------------------------------------------------+----------+--------------+-------------
id | bigint | not null default
nextval('offers_id_seq'::regclass) | plain | |
grossprice | numeric(11,2) |
| main | |
feed | integer | not null
| plain | |
product | integer |
| plain | |

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id = b.id and a.feed = b.feed and
a.position <> b.position;
UPDATE 0

This should update every row in offers_past_data when its “position”
changes. In the example above no changes were introduced since the last run
so nothing is updated (expected).

db=# select count(*) from offers_testing where product = 2;
count
-------
99
(1 row)

So there are 99 offers for product 2.

Getting a single offer:

db=# select id,grossprice from offers_testing where product = 2 limit 1;
id | grossprice
---------+------------
4127918 | 5000.00
(1 row)

Updating its grossprice:

db=# update offers_testing set grossprice = 20 where id = 4127918;
UPDATE 1

Now when executing the first query again I expect that no more than 99 rows
get updated in offers_past_data since this is the maximum amount of
positions that would be affected by offer 4127918 grossprice change.

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id = b.id and a.feed = b.feed and
a.position <> b.position;
UPDATE 104

104 rows get updated.

Executing the same query again a few minutes later (no changes meanwhile in
either table):

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id = b.id and a.feed = b.feed and
a.position <> b.position;
UPDATE 28058

This time it updates 28058 rows.

This is a test environment and nothing reads or writes to these tables.

Is this a bug or am I missing something obvious?

Regards,
Boyko

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: boyko yordanov (#1)
Re: BUG #14020: row_number() over(partition by order by) - weird behavior

On Mon, Mar 14, 2016 at 1:43 PM, <b.yordanov2@gmail.com> wrote:

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id = b.id and a.feed = b.feed and
a.position <> b.position;
UPDATE 0

​Updating offers_past_data

This should update every row in offers_past_data when its “position”
changes. In the example above no changes were introduced since the last run
so nothing is updated (expected).

db=# select count(*) from offers_testing where product = 2;
count
-------
99
(1 row)

So there are 99 offers for product 2.

​Counting offers_testing​

Getting a single offer:

db=# select id,grossprice from offers_testing where product = 2 limit 1;
id | grossprice
---------+------------
4127918 | 5000.00
(1 row)

​Counting offers_testing​

Updating its grossprice:

db=# update offers_testing set grossprice = 20 where id = 4127918;
UPDATE 1

​Updating offers_testing​

Now when executing the first query again I expect that no more than 99 rows

get updated in offers_past_data since this is the maximum amount of
positions that would be affected by offer 4127918 grossprice change.

You haven't proven to us that a single row in offers_testing cannot match
more than one row in offers_past_data. Assuming a 1-to-many situation the
update count for offers_past_data can definitely be more than the number of
rows returned by the sub-query.
​​

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id = b.id and a.feed = b.feed and
a.position <> b.position;
UPDATE 104

104 rows get updated.

Executing the same query again a few minutes later (no changes meanwhile in
either table):

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id = b.id and a.feed = b.feed and
a.position <> b.position;
UPDATE 28058

This time it updates 28058 rows.

This is a test environment and nothing reads or writes to these tables.

Is this a bug or am I missing something obvious?

​Its likely data related, not a bug.

Using the "UPDATE ... RETURNING *" form should provide good insight.
Specifically, look for all rows having the same (id, feed) pair.

Also, "ORDER BY grossprice" seems inadequate. The potential for duplicates
here - which would then make the assignment of row numbers within the
product partition random - is non-zero and is a quite likely source of your
problem - along with the probable one-to-many relationship between
offers_testing and offers_past_data.

David J.

#3boyko yordanov
b.yordanov2@gmail.com
In reply to: David G. Johnston (#2)
Re: BUG #14020: row_number() over(partition by order by) - weird behavior

Hi and thanks for your time on this.

You haven't proven to us that a single row in offers_testing cannot match more than one row in offers_past_data. Assuming a 1-to-many situation the update count for offers_past_data can definitely be more than the number of rows returned by the sub-query.

It is a one-to-one relationship between the tables as there is a primary key on (id, feed) on both tables (which I missed to point out):

Indexes:
"offers_past_data_id_feed" PRIMARY KEY, btree (id, feed)

Indexes:
"offers_testing_id_feed" PRIMARY KEY, btree (id, feed)

I assume that this guarantees that a single grossprice change in offers_testing where product = 2 translates to up to (count(id,feed) where product = 2) position updates in both offers_testing and offers_past_data.

Adding "returning *" to the questionable query, it seems to update rows that are not related to product 2 (and on my opinion should not have changed positions).

Also, "ORDER BY grossprice" seems inadequate. The potential for duplicates here - which would then make the assignment of row numbers within the product partition random - is non-zero and is a quite likely source of your problem - along with the probable one-to-many relationship between offers_testing and offers_past_data.

Dismissing the one-to-many relationship suggestion as it isn't the case.

Your point on duplicate grossprices is valid, but I believe that if I update a single grossprice, even in the case of duplicate grossprices, this should not translate in more position updates than the rows in the modified product partition. And in offers_testing there are no more than 148 rows per product partition:

db=# select max(partition_count) from (select count(*) over (partition by product) as partition_count from offers_testing) sq;
max
-----
148
(1 row)

And yet the update query updates 28k records for some reason, most of which are outside the modified product partition.

Boyko

--

Boyko

2016-03-15 6:00 GMT+02:00 David G. Johnston <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>>:
On Mon, Mar 14, 2016 at 1:43 PM, <b.yordanov2@gmail.com <mailto:b.yordanov2@gmail.com>> wrote:

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id <http://a.id/&gt; = b.id <http://b.id/&gt; and a.feed = b.feed and
a.position <> b.position;
UPDATE 0

​Updating offers_past_data

This should update every row in offers_past_data when its “position”
changes. In the example above no changes were introduced since the last run
so nothing is updated (expected).

db=# select count(*) from offers_testing where product = 2;
count
-------
99
(1 row)
So there are 99 offers for product 2.

​Counting offers_testing​

Getting a single offer:

db=# select id,grossprice from offers_testing where product = 2 limit 1;
id | grossprice
---------+------------
4127918 | 5000.00
(1 row)

​Counting offers_testing​

Updating its grossprice:

db=# update offers_testing set grossprice = 20 where id = 4127918;
UPDATE 1

​Updating offers_testing​

Now when executing the first query again I expect that no more than 99 rows
get updated in offers_past_data since this is the maximum amount of
positions that would be affected by offer 4127918 grossprice change.

You haven't proven to us that a single row in offers_testing cannot match more than one row in offers_past_data. Assuming a 1-to-many situation the update count for offers_past_data can definitely be more than the number of rows returned by the sub-query.
​​

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id <http://a.id/&gt; = b.id <http://b.id/&gt; and a.feed = b.feed and
a.position <> b.position;
UPDATE 104

104 rows get updated.

Executing the same query again a few minutes later (no changes meanwhile in
either table):

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id <http://a.id/&gt; = b.id <http://b.id/&gt; and a.feed = b.feed and
a.position <> b.position;
UPDATE 28058

This time it updates 28058 rows.

This is a test environment and nothing reads or writes to these tables.

Is this a bug or am I missing something obvious?

​Its likely data related, not a bug.

Using the "UPDATE ... RETURNING *" form should provide good insight. Specifically, look for all rows having the same (id, feed) pair.

Also, "ORDER BY grossprice" seems inadequate. The potential for duplicates here - which would then make the assignment of row numbers within the product partition random - is non-zero and is a quite likely source of your problem - along with the probable one-to-many relationship between offers_testing and offers_past_data.

David J.

#4boyko yordanov
b.yordanov2@gmail.com
In reply to: boyko yordanov (#3)
Re: BUG #14020: row_number() over(partition by order by) - weird behavior

Thinking further on this, I now got your point on the “duplicate grossprices is ordered randomly” suggestion.

What I missed to realize is that the update query updates *every* product partition that has reordered due to duplicate grossprice being ordered randomly, resulting in thousands of updates instead of just < 148 (or < 99 in the case of product = 2 partition).

Is there a way to ensure persistence of “over(order by duplicate_columns)” ordering, except for ordering by a second (or even third) column?

Overall this now makes sense and is obviously not a bug. I apologize for bothering this list (instead of ‘general’) w/ this issue.

Thanks once again!

Regards,
Boyko

Show quoted text

On Mar 15, 2016, at 10:20 AM, Boyko Yordanov <b.yordanov2@gmail.com> wrote:

Hi and thanks for your time on this.

You haven't proven to us that a single row in offers_testing cannot match more than one row in offers_past_data. Assuming a 1-to-many situation the update count for offers_past_data can definitely be more than the number of rows returned by the sub-query.

It is a one-to-one relationship between the tables as there is a primary key on (id, feed) on both tables (which I missed to point out):

Indexes:
"offers_past_data_id_feed" PRIMARY KEY, btree (id, feed)

Indexes:
"offers_testing_id_feed" PRIMARY KEY, btree (id, feed)

I assume that this guarantees that a single grossprice change in offers_testing where product = 2 translates to up to (count(id,feed) where product = 2) position updates in both offers_testing and offers_past_data.

Adding "returning *" to the questionable query, it seems to update rows that are not related to product 2 (and on my opinion should not have changed positions).

Also, "ORDER BY grossprice" seems inadequate. The potential for duplicates here - which would then make the assignment of row numbers within the product partition random - is non-zero and is a quite likely source of your problem - along with the probable one-to-many relationship between offers_testing and offers_past_data.

Dismissing the one-to-many relationship suggestion as it isn't the case.

Your point on duplicate grossprices is valid, but I believe that if I update a single grossprice, even in the case of duplicate grossprices, this should not translate in more position updates than the rows in the modified product partition. And in offers_testing there are no more than 148 rows per product partition:

db=# select max(partition_count) from (select count(*) over (partition by product) as partition_count from offers_testing) sq;
max
-----
148
(1 row)

And yet the update query updates 28k records for some reason, most of which are outside the modified product partition.

Boyko

--

Boyko

2016-03-15 6:00 GMT+02:00 David G. Johnston <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>>:
On Mon, Mar 14, 2016 at 1:43 PM, <b.yordanov2@gmail.com <mailto:b.yordanov2@gmail.com>> wrote:

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id <http://a.id/&gt; = b.id <http://b.id/&gt; and a.feed = b.feed and
a.position <> b.position;
UPDATE 0

​Updating offers_past_data

This should update every row in offers_past_data when its “position”
changes. In the example above no changes were introduced since the last run
so nothing is updated (expected).

db=# select count(*) from offers_testing where product = 2;
count
-------
99
(1 row)
So there are 99 offers for product 2.

​Counting offers_testing​

Getting a single offer:

db=# select id,grossprice from offers_testing where product = 2 limit 1;
id | grossprice
---------+------------
4127918 | 5000.00
(1 row)

​Counting offers_testing​

Updating its grossprice:

db=# update offers_testing set grossprice = 20 where id = 4127918;
UPDATE 1

​Updating offers_testing​

Now when executing the first query again I expect that no more than 99 rows
get updated in offers_past_data since this is the maximum amount of
positions that would be affected by offer 4127918 grossprice change.

You haven't proven to us that a single row in offers_testing cannot match more than one row in offers_past_data. Assuming a 1-to-many situation the update count for offers_past_data can definitely be more than the number of rows returned by the sub-query.
​​

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id <http://a.id/&gt; = b.id <http://b.id/&gt; and a.feed = b.feed and
a.position <> b.position;
UPDATE 104

104 rows get updated.

Executing the same query again a few minutes later (no changes meanwhile in
either table):

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id <http://a.id/&gt; = b.id <http://b.id/&gt; and a.feed = b.feed and
a.position <> b.position;
UPDATE 28058

This time it updates 28058 rows.

This is a test environment and nothing reads or writes to these tables.

Is this a bug or am I missing something obvious?

​Its likely data related, not a bug.

Using the "UPDATE ... RETURNING *" form should provide good insight. Specifically, look for all rows having the same (id, feed) pair.

Also, "ORDER BY grossprice" seems inadequate. The potential for duplicates here - which would then make the assignment of row numbers within the product partition random - is non-zero and is a quite likely source of your problem - along with the probable one-to-many relationship between offers_testing and offers_past_data.

David J.

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: boyko yordanov (#4)
Re: BUG #14020: row_number() over(partition by order by) - weird behavior

On Tue, Mar 15, 2016 at 2:04 AM, Boyko Yordanov <b.yordanov2@gmail.com>
wrote:

Thinking further on this, I now got your point on the “duplicate
grossprices is ordered randomly” suggestion.

What I missed to realize is that the update query updates *every* product
partition that has reordered due to duplicate grossprice being ordered
randomly, resulting in thousands of updates instead of just < 148 (or < 99
in the case of product = 2 partition).

Is there a way to ensure persistence of “over(order by duplicate_columns)”
ordering, except for ordering by a second (or even third) column?

​No, you need to have enough columns for deterministic order.

Dave