MAP syntax for arrays

Started by Ildar Musinalmost 8 years ago16 messageshackers
Jump to latest
#1Ildar Musin
i.musin@postgrespro.ru

Hello hackers,

Recently I was working with sql arrays in postgres and it turned out
that postgres doesn't have such very convinient functional constructions
as map, reduce and filter. Currently to map function over array user has
to make a subquery like:

select u.* from
my_table,
lateral (
select array_agg(lower(elem))
from unnest(arr) as elem
) as u;

Which is not only inconvenient but not very efficient as well (see
'Demo' section below).

When I dug into the code I found that postgres already has the needed
infrastructure for implementing map for arrays; actually array coercing
already works that way (it basically maps cast function).

In the attached patch there is a simple map implementation which
introduces new expression type and syntax:

MAP(<func_name> OVER <array_expression>)

For example:

SELECT MAP(upper OVER array['one', 'two', 'three']::text[]);
?column?
-----------------
{ONE,TWO,THREE}
(1 row)

This is probably not the most useful notation and it would be better to
have syntax for mapping arbitrary expressions over array, not just
function. I'm struggling to come up with a good idea of how it should
look like. It could look something like following:

MAP(<expr> FOR <placeholder> IN <array_expressin>)

For instance:

SELECT MAP(x*2 FOR x IN array[1, 2, 3]::int[]);

Looking forward for community's suggestions!

Demo
----

Here is a small comparison between map and unnest/aggregate ways for
per-element processing of arrays. Given a table with 1K rows which
contains single column of text[] type. Each array contains 5/10/100
elements.

create table my_table (arr text[]);
insert into my_table
select array_agg(md5(random()::text))
from generate_series(1, 1000) as rows,
generate_series(1, 10) as elements
group by rows;

There are two scripts for pgbench. One for 'map' syntax:

select map(upper over arr) from my_table;

And one for unnest/aggregate:

select u.* from my_table,
lateral (
select array_agg(upper(elem))
from unnest(arr) as elem
) as u;

Results are:

elements per array | map (tps) | unnest/aggregate (tps)
--------------------+------------+------------------------
5 | 139.105359 | 74.434010
10 | 74.089743 | 43.622554
100 | 7.693000 | 5.325805

Apparently map is more efficient for small arrays. And as the size of
array increases the difference decreases.

I'll be glad to any input from the community. Thanks!

--
Ildar Musin
i.musin@postgrespro.ru

Attachments:

map_v1.patchtext/x-diff; name=map_v1.patchDownload+332-1
#2Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ildar Musin (#1)
Re: MAP syntax for arrays

On Fri, May 4, 2018 at 6:38 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:

Hello hackers,

Recently I was working with sql arrays in postgres and it turned out
that postgres doesn't have such very convinient functional constructions
as map, reduce and filter. Currently to map function over array user has
to make a subquery like:

select u.* from
my_table,
lateral (
select array_agg(lower(elem))
from unnest(arr) as elem
) as u;

Which is not only inconvenient but not very efficient as well (see
'Demo' section below).

Is there a way we can improve unnest() and array_agg() to match the
performance you have specified by let's say optimizing the cases
specially when those two are used together. Identifying that may be
some work, but will not require introducing new syntax.

When I dug into the code I found that postgres already has the needed
infrastructure for implementing map for arrays; actually array coercing
already works that way (it basically maps cast function).

In the attached patch there is a simple map implementation which
introduces new expression type and syntax:

MAP(<func_name> OVER <array_expression>)

For example:

SELECT MAP(upper OVER array['one', 'two', 'three']::text[]);
?column?
-----------------
{ONE,TWO,THREE}
(1 row)

This is probably not the most useful notation and it would be better to
have syntax for mapping arbitrary expressions over array, not just
function. I'm struggling to come up with a good idea of how it should
look like. It could look something like following:

MAP(<expr> FOR <placeholder> IN <array_expressin>)

For instance:

SELECT MAP(x*2 FOR x IN array[1, 2, 3]::int[]);

Looking forward for community's suggestions!

What if the expression has more than one variable, each mapping to a
different array? What if the arrays have different lengths or worse
different dimensions? This looks like the way SRFs used to work.

Instead of introducing a new syntax, is it possible to detect that
argument to a function is an array of the same type as the argument
and apply MAP automatically? In your example, upper(arr) would detect
that the input is an array of the same type as the scalar argument
type and do array_agg(upper(arr[id1], arr[id2], ...).

elements per array | map (tps) | unnest/aggregate (tps)
--------------------+------------+------------------------
5 | 139.105359 | 74.434010
10 | 74.089743 | 43.622554
100 | 7.693000 | 5.325805

Apparently map is more efficient for small arrays. And as the size of
array increases the difference decreases.

I am afraid that the way difference is diminishing with increase in
the number of elements, unnest, array_agg combination might win for
large number of elements. Have you tried that?
If we try to improve unnest, array_agg combination for small array, we
will get consistent performance without any additional syntax.
Although, I admit that query involving unnest and array_agg is not
very readable.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ashutosh Bapat (#2)
Re: MAP syntax for arrays

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:

Is there a way we can improve unnest() and array_agg() to match the
performance you have specified by let's say optimizing the cases
specially when those two are used together. Identifying that may be
some work, but will not require introducing new syntax.

+1. The first thing I thought on seeing this proposal was "I wonder
how long it will be before the SQL committee introduces some syntax
that uses the MAP keyword and breaks this".

ISTM the planner could be taught to notice the combination of unnest
and array_agg and produce a special output plan from that.

It is, however, fair to wonder whether this is worth our time to
optimize. I've not noticed a lot of people complaining about
performance of this sort of thing, at least not since we fixed
array_agg to not be O(N^2).

regards, tom lane

#4Ildar Musin
i.musin@postgrespro.ru
In reply to: Tom Lane (#3)
Re: MAP syntax for arrays

Hello Tom, Ashutosh,

On 07.05.2018 18:16, Tom Lane wrote:

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:

Is there a way we can improve unnest() and array_agg() to match
the performance you have specified by let's say optimizing the
cases specially when those two are used together. Identifying that
may be some work, but will not require introducing new syntax.

+1. The first thing I thought on seeing this proposal was "I wonder
how long it will be before the SQL committee introduces some syntax
that uses the MAP keyword and breaks this".

ISTM the planner could be taught to notice the combination of unnest
and array_agg and produce a special output plan from that.

It is, however, fair to wonder whether this is worth our time to
optimize. I've not noticed a lot of people complaining about
performance of this sort of thing, at least not since we fixed
array_agg to not be O(N^2).

The main point of this patch was about convenience; the performance
thing came out later just as a side effect :) Many users are familiar
with "map/reduce/filter" concept that many languages (not only
functional ones) utilized. And my though was that it would be great to
have those in postgres too.

Apparently there is also a case when unnest/array_agg may not produce
the result we're looking for. I mean multidimensional arrays. E.g.

select array_agg(x * 2)
from unnest(array[[1,2],[3,4],[5,6]]) as x;
array_agg
-----------------
{2,4,6,8,10,12}
(1 row)

select map(x * 2 for x in array[[1,2],[3,4],[5,6]]);
?column?
-----------------------
{{2,4},{6,8},{10,12}}
(1 row)

array_agg produces plain arrays no matter what the input was.

There is a new version of the patch in the attachment which introduces
arbitrary per-element expressions (instead of single function call). So
now user can specify a placeholder representing one element of the array
and use it in the expressions. Like following:

select map (pow(x, 2) - 1 for x in array[1,2,3]);
?column?
---------------
{1,3,7,15,31}
(1 row)

--
Ildar Musin
i.musin@postgrespro.ru

Attachments:

map_v2.patchtext/x-diff; name=map_v2.patchDownload+411-1
#5Ildar Musin
i.musin@postgrespro.ru
In reply to: Ildar Musin (#4)
Re: MAP syntax for arrays

On 08.05.2018 15:49, Ildar Musin wrote:

select map (pow(x, 2) - 1 for x in array[1,2,3]);

Sorry, the example should be:

select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);
?column?
---------------
{1,3,7,15,31}
(1 row)

--
Ildar Musin
i.musin@postgrespro.ru

#6Chapman Flack
chap@anastigmatix.net
In reply to: Ildar Musin (#5)
Re: MAP syntax for arrays

On 05/08/2018 08:57 AM, Ildar Musin wrote:

select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);

I wonder how efficient an implementation would be possible strictly
as a function, without grammar changes?

While PostgreSQL certainly has extensions to and divergences from
standard SQL syntax, some historical and some recent, it seems like
there ought to be a highish bar for adding new ones; or, looking at it
another way, has this feature been proposed to the SQL committee?
It doesn't sound like a bad idea, and supporting new syntax for it
would be an easier call it if it were known to be in an SQL draft
somewhere.

-Chap

#7Chapman Flack
chap@anastigmatix.net
In reply to: Chapman Flack (#6)
Re: MAP syntax for arrays

On 05/08/2018 09:19 AM, Chapman Flack wrote:

While PostgreSQL certainly has extensions to and divergences from
standard SQL syntax, some historical and some recent, it seems like
there ought to be a highish bar for adding new ones; or, looking at it
another way, has this feature been proposed to the SQL committee?
It doesn't sound like a bad idea, and supporting new syntax for it
would be an easier call it if it were known to be in an SQL draft
somewhere.

There seems to have been some work at Databricks adding higher-order
function syntax to their SQL; they've chosen the name 'transform'
for 'map', and also provided 'exists', 'filter', and 'aggregate'.

https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html

-Chap

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Chapman Flack (#6)
Re: MAP syntax for arrays

On 5/8/18 09:19, Chapman Flack wrote:

On 05/08/2018 08:57 AM, Ildar Musin wrote:

select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);

I wonder how efficient an implementation would be possible strictly
as a function, without grammar changes?

Yeah, you can pass a function to another function (using regprocedure or
just oid), so this should be possible entirely in user space.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#8)
Re: MAP syntax for arrays

Peter Eisentraut wrote:

On 5/8/18 09:19, Chapman Flack wrote:

On 05/08/2018 08:57 AM, Ildar Musin wrote:

select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);

I wonder how efficient an implementation would be possible strictly
as a function, without grammar changes?

Yeah, you can pass a function to another function (using regprocedure or
just oid), so this should be possible entirely in user space.

How would you invoke it? It seems you'd be forced to use EXECUTE in a
plpgsql function, or a C function.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Andreas Karlsson
andreas.karlsson@percona.com
In reply to: Ildar Musin (#4)
Re: MAP syntax for arrays

On 05/08/2018 02:49 PM, Ildar Musin wrote:

The main point of this patch was about convenience; the performance
thing came out later just as a side effect :) Many users are familiar
with "map/reduce/filter" concept that many languages (not only
functional ones) utilized. And my though was that it would be great to
have those in postgres too.

Map is a feature I have quite often wished to have, but I am not sure it
is quite useful enough to be worth adding our own proprietary syntax. It
would be a pain if the SQL committee started using MAP for something.

Andreas

#11Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Andreas Karlsson (#10)
Re: MAP syntax for arrays

"Andreas" == Andreas Karlsson <andreas@proxel.se> writes:

Andreas> It would be a pain if the SQL committee started using MAP for
Andreas> something.

They already did - MAP is a non-reserved keyword in sql2016, used at
least with <user-defined ordering definition>. Can't see any obvious
conflict with use in expressions, but I haven't checked all the
references.

--
Andrew (irc:RhodiumToad)

#12Peter Eisentraut
peter_e@gmx.net
In reply to: Alvaro Herrera (#9)
Re: MAP syntax for arrays

On 5/8/18 10:18, Alvaro Herrera wrote:

Peter Eisentraut wrote:

On 5/8/18 09:19, Chapman Flack wrote:

On 05/08/2018 08:57 AM, Ildar Musin wrote:

select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);

I wonder how efficient an implementation would be possible strictly
as a function, without grammar changes?

Yeah, you can pass a function to another function (using regprocedure or
just oid), so this should be possible entirely in user space.

How would you invoke it? It seems you'd be forced to use EXECUTE in a
plpgsql function, or a C function.

Yes, I was thinking about a C function.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#12)
Re: MAP syntax for arrays

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 5/8/18 10:18, Alvaro Herrera wrote:

How would you invoke it? It seems you'd be forced to use EXECUTE in a
plpgsql function, or a C function.

Yes, I was thinking about a C function.

The thing actually implementing MAP would presumably be in C,
so this doesn't seem like a problem technically. But having to
create a function seems like a big usability stumbling block,
probably a big enough one to make the "select array_agg(expression)
from unnest(something)" approach more attractive.

I do see the usability benefit of a dedicated MAP syntax --- I'm
just afraid of getting out in front of the SQL committee on such
things. I doubt that it's enough nicer than the sub-select way
to justify risking future standards-compliance issues.

Realistically, we're talking about this:

select a, b, (select array_agg(x*2) from unnest(arraycol) x)
from ...

versus something on the order of this:

select a, b, map(x*2 over x from arraycol)
from ...

Yeah, it's a bit shorter, but not that much ... and there's a lot
more you can do with the sub-select syntax, eg add a WHERE filter.

regards, tom lane

#14Ildar Musin
i.musin@postgrespro.ru
In reply to: Peter Eisentraut (#8)
Re: MAP syntax for arrays

On 08.05.2018 17:15, Peter Eisentraut wrote:

On 5/8/18 09:19, Chapman Flack wrote:

On 05/08/2018 08:57 AM, Ildar Musin wrote:

select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);

I wonder how efficient an implementation would be possible
strictly as a function, without grammar changes?

Yeah, you can pass a function to another function (using regprocedure
or just oid), so this should be possible entirely in user space.

The problem with this approach is that extension should either have
single map() function with input and output type of anyarray which
cannot be used when user needs to map int[] to text[] for example. Or
the other way there should be a set of map functions for different
intput/output types.

Another thing is that this approach is not as versatile since user need
to create a function before running map, while with the proposed patch
they could run arbitrary expression over the array directly.

--
Ildar Musin
i.musin@postgrespro.ru

#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Gierth (#11)
Re: MAP syntax for arrays

Andrew Gierth wrote:

"Andreas" == Andreas Karlsson <andreas@proxel.se> writes:

Andreas> It would be a pain if the SQL committee started using MAP for
Andreas> something.

They already did - MAP is a non-reserved keyword in sql2016, used at
least with <user-defined ordering definition>. Can't see any obvious
conflict with use in expressions, but I haven't checked all the
references.

Ah, so in SQL2011 (and 2016) there already is a designator for routines,
which was a thing we were lacking previously, as I recall.

<specific routine designator> ::=
SPECIFIC <routine type> <specific name>
| <routine type> <member name> [ FOR <schema-resolved user-defined type name> ]

<routine type> ::=
ROUTINE
| FUNCTION
| PROCEDURE
| [ INSTANCE | STATIC | CONSTRUCTOR ] METHOD

<member name> ::=
<member name alternatives> [ <data type list> ]

<member name alternatives> ::=
<schema qualified routine name>
| <method name>

<data type list> ::= <left paren> [ <data type> [ { <comma> <data type> }... ] ] <right paren>

[elsewhere]

<specific name> ::=
<schema qualified name>

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ildar Musin (#14)
Re: MAP syntax for arrays

On 08/05/18 18:11, Ildar Musin wrote:

On 08.05.2018 17:15, Peter Eisentraut wrote:

On 5/8/18 09:19, Chapman Flack wrote:

On 05/08/2018 08:57 AM, Ildar Musin wrote:

select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);

I wonder how efficient an implementation would be possible
strictly as a function, without grammar changes?

Yeah, you can pass a function to another function (using regprocedure
or just oid), so this should be possible entirely in user space.

The problem with this approach is that extension should either have
single map() function with input and output type of anyarray which
cannot be used when user needs to map int[] to text[] for example. Or
the other way there should be a set of map functions for different
intput/output types.

Another thing is that this approach is not as versatile since user need
to create a function before running map, while with the proposed patch
they could run arbitrary expression over the array directly.

The consensus on this seems to be that we don't want this. Yeah, it's a
handy syntax, but it's not that much better than using a subselect or a
writing a user-defined function. And there's risk of future conflicts
with the SQL standard. I'll mark this as "Rejected" in the commitfest.

- Heikki