Row pattern recognition

Started by Tatsuo Ishiialmost 3 years ago308 messageshackers

t-ishii@sra.co.jp

almost 3 years ago

Attached is a PoC patch to implement "Row pattern recognition" (RPR)
in SQL:2016 (I know SQL:2023 is already out, but I don't have access
to it). Actually SQL:2016 defines RPR in two places[1]https://sqlperformance.com/2019/04/t-sql-queries/row-pattern-recognition-in-sql:

Feature R010, “Row pattern recognition: FROM clause”
Feature R020, “Row pattern recognition: WINDOW clause”

The patch includes partial support for R020 part.

- What is RPR?

RPR provides a way to search series of data using regular expression
patterns. Suppose you have a stock database.

CREATE TABLE stock (
company TEXT,
tdate DATE,
price BIGINT);

You want to find a "V-shaped" pattern: i.e. price goes down for 1 or
more days, then goes up for 1 or more days. If we try to implement the
query in PostgreSQL, it could be quite complex and inefficient.

RPR provides convenient way to implement the query.

SELECT company, tdate, price, rpr(price) OVER w FROM stock
WINDOW w AS (
PARTITION BY company
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
PATTERN (START DOWN+ UP+)
DEFINE
START AS TRUE,
UP AS price > PREV(price),
DOWN AS price < PREV(price)
);

"PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3
"row pattern variables" namely START, UP and DOWN. They are associated
with logical expressions namely "TRUE", "price > PREV(price)", and
"price < PREV(price)". Note that "PREV" function returns price column
in the previous row. So, UP is true if price is higher than previous
day. On the other hand, DOWN is true if price is lower than previous
day. PATTERN uses the row pattern variables to create a necessary
pattern. In this case, the first row is always match because START is
always true, and second or more rows match with "UP" ('+' is a regular
expression representing one or more), and subsequent rows match with
"DOWN".

Here is the sample output.

company | tdate | price | rpr
----------+------------+-------+------
company1 | 2023-07-01 | 100 |
company1 | 2023-07-02 | 200 | 200 -- 200->150->140->150
company1 | 2023-07-03 | 150 | 150 -- 150->140->150
company1 | 2023-07-04 | 140 |
company1 | 2023-07-05 | 150 | 150 -- 150->90->110->130
company1 | 2023-07-06 | 90 |
company1 | 2023-07-07 | 110 |
company1 | 2023-07-08 | 130 |
company1 | 2023-07-09 | 120 |
company1 | 2023-07-10 | 130 |

rpr shows the first row if all the patterns are satisfied. In the
example above 200, 150, 150 are the cases. Other rows are shown as
NULL. For example, on 2023-07-02 price starts with 200, then goes down
to 150 then 140 but goes up 150 next day.

As far as I know, only Oracle implements RPR (only R010. R020 is not
implemented) among OSS/commercial RDBMSs. There are a few DWH software
having RPR. According to [2]https://link.springer.com/article/10.1007/s13222-022-00404-3 they are Snowflake and MS Stream
Analytics. It seems Trino is another one[3]https://trino.io/docs/current/sql/pattern-recognition-in-window.html -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp.

- Note about the patch

The current implementation is not only a subset of the standard, but
is different from it in some places. The example above is written as
follows according to the standard:

SELECT company, tdate, startprice OVER w FROM stock
WINDOW w AS (
PARTITION BY company
MEASURES
START.price AS startprice
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
PATTERN (START DOWN+ UP+)
DEFINE
START AS TRUE,
UP AS UP.price > PREV(UP.price),
DOWN AS DOWN.price < PREV(DOWN.price)
);

Notice that rpr(price) is written as START.price and startprice in the
standard. MEASURES defines variable names used in the target list used
with "OVER window". As OVER only allows functions in PostgreSQL, I had
to make up a window function "rpr" which performs the row pattern
recognition task. I was not able to find a way to implement
expressions like START.price (START is not a table alias). Any
suggestion is greatly appreciated.

The differences from the standard include:

o MEASURES is not supported
o SUBSET is not supported
o FIRST, LAST, CLASSIFIER are not supported
o PREV/NEXT in the standard accept more complex arguments
o Regular expressions other than "+" are not supported
o Only AFTER MATCH SKIP TO NEXT ROW is supported (if AFTER MATCH is
not specified, AFTER MATCH SKIP TO NEXT ROW is assumed. In the
standard AFTER MATCH SKIP PAST LAST ROW is assumed in this case). I
have a plan to implement AFTER MATCH SKIP PAST LAST ROW though.
o INITIAL or SEEK are not supported ((behaves as if INITIAL is specified)
o Aggregate functions associated with window clause using RPR do not respect RPR

It seems RPR in the standard is quite complex. I think we can start
with a small subset of RPR then we could gradually enhance the
implementation.

Comments and suggestions are welcome.

[1]: https://sqlperformance.com/2019/04/t-sql-queries/row-pattern-recognition-in-sql
[2]: https://link.springer.com/article/10.1007/s13222-022-00404-3
[3]: https://trino.io/docs/current/sql/pattern-recognition-in-window.html -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Vik Fearing

vik@postgresfriends.org

almost 3 years ago

In reply to: Tatsuo Ishii (#1)

Re: Row pattern recognition

On 6/25/23 14:05, Tatsuo Ishii wrote:

Attached is a PoC patch to implement "Row pattern recognition" (RPR)
in SQL:2016 (I know SQL:2023 is already out, but I don't have access
to it). Actually SQL:2016 defines RPR in two places[1]:

Feature R010, “Row pattern recognition: FROM clause”
Feature R020, “Row pattern recognition: WINDOW clause”

The patch includes partial support for R020 part.

I have been dreaming of and lobbying for someone to pick up this
feature. I will be sure to review it from a standards perspective and
will try my best to help with the technical aspect, but I am not sure to
have the qualifications for that.

THANK YOU!

(I know SQL:2023 is already out, but I don't have access to it)

If you can, try to get ISO/IEC 19075-5 which is a guide to RPR instead
of just its technical specification.

https://www.iso.org/standard/78936.html

- What is RPR?

RPR provides a way to search series of data using regular expression
patterns. Suppose you have a stock database.

CREATE TABLE stock (
company TEXT,
tdate DATE,
price BIGINT);

You want to find a "V-shaped" pattern: i.e. price goes down for 1 or
more days, then goes up for 1 or more days. If we try to implement the
query in PostgreSQL, it could be quite complex and inefficient.

RPR provides convenient way to implement the query.

SELECT company, tdate, price, rpr(price) OVER w FROM stock
WINDOW w AS (
PARTITION BY company
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
PATTERN (START DOWN+ UP+)
DEFINE
START AS TRUE,
UP AS price > PREV(price),
DOWN AS price < PREV(price)
);

"PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3
"row pattern variables" namely START, UP and DOWN. They are associated
with logical expressions namely "TRUE", "price > PREV(price)", and
"price < PREV(price)". Note that "PREV" function returns price column
in the previous row. So, UP is true if price is higher than previous
day. On the other hand, DOWN is true if price is lower than previous
day. PATTERN uses the row pattern variables to create a necessary
pattern. In this case, the first row is always match because START is
always true, and second or more rows match with "UP" ('+' is a regular
expression representing one or more), and subsequent rows match with
"DOWN".

Here is the sample output.

company | tdate | price | rpr
----------+------------+-------+------
company1 | 2023-07-01 | 100 |
company1 | 2023-07-02 | 200 | 200 -- 200->150->140->150
company1 | 2023-07-03 | 150 | 150 -- 150->140->150
company1 | 2023-07-04 | 140 |
company1 | 2023-07-05 | 150 | 150 -- 150->90->110->130
company1 | 2023-07-06 | 90 |
company1 | 2023-07-07 | 110 |
company1 | 2023-07-08 | 130 |
company1 | 2023-07-09 | 120 |
company1 | 2023-07-10 | 130 |

rpr shows the first row if all the patterns are satisfied. In the
example above 200, 150, 150 are the cases. Other rows are shown as
NULL. For example, on 2023-07-02 price starts with 200, then goes down
to 150 then 140 but goes up 150 next day.

I don't understand this. RPR in a window specification limits the
window to the matched rows, so this looks like your rpr() function is
just the regular first_value() window function that we already have?

As far as I know, only Oracle implements RPR (only R010. R020 is not
implemented) among OSS/commercial RDBMSs. There are a few DWH software
having RPR. According to [2] they are Snowflake and MS Stream
Analytics. It seems Trino is another one[3].

I thought DuckDB had it already, but it looks like I was wrong.

- Note about the patch

The current implementation is not only a subset of the standard, but
is different from it in some places. The example above is written as
follows according to the standard:

SELECT company, tdate, startprice OVER w FROM stock
WINDOW w AS (
PARTITION BY company
MEASURES
START.price AS startprice
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
PATTERN (START DOWN+ UP+)
DEFINE
START AS TRUE,
UP AS UP.price > PREV(UP.price),
DOWN AS DOWN.price < PREV(DOWN.price)
);

Notice that rpr(price) is written as START.price and startprice in the
standard. MEASURES defines variable names used in the target list used
with "OVER window". As OVER only allows functions in PostgreSQL, I had
to make up a window function "rpr" which performs the row pattern
recognition task. I was not able to find a way to implement
expressions like START.price (START is not a table alias). Any
suggestion is greatly appreciated.

As in your example, you cannot have START.price outside of the window
specification; it can only go in the MEASURES clause. Only startprice
is allowed outside and it gets its qualification from the OVER. Using
w.startprice might have been better but that would require window names
to be in the same namespace as range tables.

This currently works in Postgres:

SELECT RANK() OVER w
FROM (VALUES (1)) AS w (x)
WINDOW w AS (ORDER BY w.x);

The differences from the standard include:

o MEASURES is not supported
o FIRST, LAST, CLASSIFIER are not supported
o PREV/NEXT in the standard accept more complex arguments
o INITIAL or SEEK are not supported ((behaves as if INITIAL is specified)

Okay, for now.

o SUBSET is not supported

Is this because you haven't done it yet, or because you ran into
problems trying to do it?

o Regular expressions other than "+" are not supported

This is what I had a hard time imagining how to do while thinking about
it. The grammar is so different here and we allow many more operators
(like "?" which is also the standard parameter symbol). People more
expert than me will have to help here.

o Only AFTER MATCH SKIP TO NEXT ROW is supported (if AFTER MATCH is
not specified, AFTER MATCH SKIP TO NEXT ROW is assumed. In the
standard AFTER MATCH SKIP PAST LAST ROW is assumed in this case). I
have a plan to implement AFTER MATCH SKIP PAST LAST ROW though.

In this case, we should require the user to specify AFTER MATCH SKIP TO
NEXT ROW so that behavior doesn't change when we implement the standard
default. (Your patch might do this already.)

o Aggregate functions associated with window clause using RPR do not respect RPR

I do not understand what this means.

It seems RPR in the standard is quite complex. I think we can start
with a small subset of RPR then we could gradually enhance the
implementation.

I have no problem with that as long as we don't paint ourselves into a
corner.

Comments and suggestions are welcome.

I have not looked at the patch yet, but is the reason for doing R020
before R010 because you haven't done the MEASURES clause yet?

In any case, I will be watching this with a close eye, and I am eager to
help in any way I can.
--
Vik Fearing

Row pattern recognition

Attachments:

Attachments:

Attachments:

Attachments: