crosstab/repivot...any interest?

Started by Merlin Moncurealmost 7 years ago8 messages
#1Merlin Moncure
mmoncure@gmail.com

Hackers,

We have a strong need to make a variant to the crosstab interface so
that data that is pivoted one way would be sent through a crosstab
like function so that it would be pivoted another way. For example,
if you had

row 0: a1, a2, a3, k1, c1, c2, ...
row 1: a1, a2, a3, k2, c1, c2, ...
row 2: a1, a2, a3, k3, c1, c2, ...
...

where 'a' columns are uninteresting attribute columns, 'k' is the
dimension we want to pivot on, and c1->cN would be stacked vertically,
so that we'd end up with,
row 0: a1, a2, a3, c1, k1, k2, ...
row 1: a1, a2, a3, c2, k1, k2, ...
row 2: a1, a2, a3, c3, k1, k2, ...

There are various SQL level approaches to this but they tend to be
imperformant with large datasets so that I think a crosstab-like C
implementation ought to be able to do better (or at least I hope so)
since you have to cross product rows and columns in such a way that
you can get a clean join. Cribbing from tablefunc.c I don't think
this is a terrible challenge to do in hash table style.

Questions on the table:
*) Has anyone done anything like this or know of any current implementations?
*) Would there be any interest in expanding tablefunc along these lines?

thanks in advance,
merlin

#2David Fetter
david@fetter.org
In reply to: Merlin Moncure (#1)
Re: crosstab/repivot...any interest?

On Fri, Jan 25, 2019 at 02:21:55PM -0600, Merlin Moncure wrote:

Hackers,

We have a strong need to make a variant to the crosstab interface so
that data that is pivoted one way would be sent through a crosstab
like function so that it would be pivoted another way. For example,
if you had

row 0: a1, a2, a3, k1, c1, c2, ...
row 1: a1, a2, a3, k2, c1, c2, ...
row 2: a1, a2, a3, k3, c1, c2, ...
...

where 'a' columns are uninteresting attribute columns, 'k' is the
dimension we want to pivot on, and c1->cN would be stacked vertically,
so that we'd end up with,
row 0: a1, a2, a3, c1, k1, k2, ...
row 1: a1, a2, a3, c2, k1, k2, ...
row 2: a1, a2, a3, c3, k1, k2, ...

There are various SQL level approaches to this but they tend to be
imperformant with large datasets so that I think a crosstab-like C
implementation ought to be able to do better (or at least I hope so)
since you have to cross product rows and columns in such a way that
you can get a clean join. Cribbing from tablefunc.c I don't think
this is a terrible challenge to do in hash table style.

Questions on the table:
*) Has anyone done anything like this or know of any current implementations?
*) Would there be any interest in expanding tablefunc along these lines?

There's something in SQL:2016 that I read as crosstabs, or at least as
enabling crosstabs.
https://www.iso.org/standard/69776.html

If we're going to put work into crosstabs, it seems to me that the
"we" needs to be the project as a whole, and the work should be, to
the extent reasonable, toward standard compliance.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#3Merlin Moncure
mmoncure@gmail.com
In reply to: David Fetter (#2)
Re: crosstab/repivot...any interest?

On Fri, Jan 25, 2019 at 3:16 PM David Fetter <david@fetter.org> wrote:

On Fri, Jan 25, 2019 at 02:21:55PM -0600, Merlin Moncure wrote:

Hackers,

We have a strong need to make a variant to the crosstab interface so
that data that is pivoted one way would be sent through a crosstab
like function so that it would be pivoted another way. For example,
if you had

row 0: a1, a2, a3, k1, c1, c2, ...
row 1: a1, a2, a3, k2, c1, c2, ...
row 2: a1, a2, a3, k3, c1, c2, ...
...

where 'a' columns are uninteresting attribute columns, 'k' is the
dimension we want to pivot on, and c1->cN would be stacked vertically,
so that we'd end up with,
row 0: a1, a2, a3, c1, k1, k2, ...
row 1: a1, a2, a3, c2, k1, k2, ...
row 2: a1, a2, a3, c3, k1, k2, ...

There are various SQL level approaches to this but they tend to be
imperformant with large datasets so that I think a crosstab-like C
implementation ought to be able to do better (or at least I hope so)
since you have to cross product rows and columns in such a way that
you can get a clean join. Cribbing from tablefunc.c I don't think
this is a terrible challenge to do in hash table style.

Questions on the table:
*) Has anyone done anything like this or know of any current implementations?
*) Would there be any interest in expanding tablefunc along these lines?

There's something in SQL:2016 that I read as crosstabs, or at least as
enabling crosstabs.
https://www.iso.org/standard/69776.html

If we're going to put work into crosstabs, it seems to me that the
"we" needs to be the project as a whole, and the work should be, to
the extent reasonable, toward standard compliance.

Interesting. Do you see that the spec (it makes my brain hurt) can
handle that kind of repivoting?

merlin

#4David Fetter
david@fetter.org
In reply to: Merlin Moncure (#3)
Re: crosstab/repivot...any interest?

On Fri, Jan 25, 2019 at 04:31:00PM -0600, Merlin Moncure wrote:

On Fri, Jan 25, 2019 at 3:16 PM David Fetter <david@fetter.org> wrote:

On Fri, Jan 25, 2019 at 02:21:55PM -0600, Merlin Moncure wrote:

Hackers,

We have a strong need to make a variant to the crosstab interface so
that data that is pivoted one way would be sent through a crosstab
like function so that it would be pivoted another way. For example,
if you had

row 0: a1, a2, a3, k1, c1, c2, ...
row 1: a1, a2, a3, k2, c1, c2, ...
row 2: a1, a2, a3, k3, c1, c2, ...
...

where 'a' columns are uninteresting attribute columns, 'k' is the
dimension we want to pivot on, and c1->cN would be stacked vertically,
so that we'd end up with,
row 0: a1, a2, a3, c1, k1, k2, ...
row 1: a1, a2, a3, c2, k1, k2, ...
row 2: a1, a2, a3, c3, k1, k2, ...

There are various SQL level approaches to this but they tend to be
imperformant with large datasets so that I think a crosstab-like C
implementation ought to be able to do better (or at least I hope so)
since you have to cross product rows and columns in such a way that
you can get a clean join. Cribbing from tablefunc.c I don't think
this is a terrible challenge to do in hash table style.

Questions on the table:
*) Has anyone done anything like this or know of any current implementations?
*) Would there be any interest in expanding tablefunc along these lines?

There's something in SQL:2016 that I read as crosstabs, or at least as
enabling crosstabs.
https://www.iso.org/standard/69776.html

If we're going to put work into crosstabs, it seems to me that the
"we" needs to be the project as a whole, and the work should be, to
the extent reasonable, toward standard compliance.

Interesting. Do you see that the spec (it makes my brain hurt) can
handle that kind of repivoting?

I believe the constructs can nest and/or refer to each other, so yes.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#5Morris de Oryx
morrisdeoryx@gmail.com
In reply to: David Fetter (#4)
Re: crosstab/repivot...any interest?

Hello, I'm not a C coder and can't help....but I *love* cross-tab/pivot
tables. They're the best, and just fantastic for preparing data to feed
into various analysis tools. The tablefunc module is helpful, but a bit
awkward to use (as I remember it.)

From a user's point of view, I high-performance cross-tab generator would
be just fantastic.

As I understand it, this is what's involved in a pivot:

1. Identify rows that should be *grouped* (consolidated.)
2. Distinguish the value that identifies each *derived* column.
3. Distinguish the value that *identifies* each row-column value.
4. *Collapse* the rows, *build* the columns, and *populate* the 'cells'
with data.

In an ideal world, you would be able to perform different grouping
operations. Such as count, sum, avg, etc.

If there's a way to do this in a system-wide and standards-pointing way, so
much the better.

*Apologies* if I'm violating list etiquette by jumping in here. I've been
lurking on several Postgres lists for a bit and picking up interesting
details every day. If I've been Unintentionally and Cluelessly Off, I'm
find with being told.

On Sat, Jan 26, 2019 at 10:49 AM David Fetter <david@fetter.org> wrote:

Show quoted text

On Fri, Jan 25, 2019 at 04:31:00PM -0600, Merlin Moncure wrote:

On Fri, Jan 25, 2019 at 3:16 PM David Fetter <david@fetter.org> wrote:

On Fri, Jan 25, 2019 at 02:21:55PM -0600, Merlin Moncure wrote:

Hackers,

We have a strong need to make a variant to the crosstab interface so
that data that is pivoted one way would be sent through a crosstab
like function so that it would be pivoted another way. For example,
if you had

row 0: a1, a2, a3, k1, c1, c2, ...
row 1: a1, a2, a3, k2, c1, c2, ...
row 2: a1, a2, a3, k3, c1, c2, ...
...

where 'a' columns are uninteresting attribute columns, 'k' is the
dimension we want to pivot on, and c1->cN would be stacked

vertically,

so that we'd end up with,
row 0: a1, a2, a3, c1, k1, k2, ...
row 1: a1, a2, a3, c2, k1, k2, ...
row 2: a1, a2, a3, c3, k1, k2, ...

There are various SQL level approaches to this but they tend to be
imperformant with large datasets so that I think a crosstab-like C
implementation ought to be able to do better (or at least I hope so)
since you have to cross product rows and columns in such a way that
you can get a clean join. Cribbing from tablefunc.c I don't think
this is a terrible challenge to do in hash table style.

Questions on the table:
*) Has anyone done anything like this or know of any current

implementations?

*) Would there be any interest in expanding tablefunc along these

lines?

There's something in SQL:2016 that I read as crosstabs, or at least as
enabling crosstabs.
https://www.iso.org/standard/69776.html

If we're going to put work into crosstabs, it seems to me that the
"we" needs to be the project as a whole, and the work should be, to
the extent reasonable, toward standard compliance.

Interesting. Do you see that the spec (it makes my brain hurt) can
handle that kind of repivoting?

I believe the constructs can nest and/or refer to each other, so yes.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#6Merlin Moncure
mmoncure@gmail.com
In reply to: Morris de Oryx (#5)

On Fri, Jan 25, 2019 at 9:14 PM Morris de Oryx <morrisdeoryx@gmail.com>
wrote:

Hello, I'm not a C coder and can't help....but I love cross-tab/pivot

tables. They're the best, and just fantastic for preparing data to feed
into various analysis tools. The tablefunc module is helpful, but a bit
awkward to use (as I remember it.)

From a user's point of view, I high-performance cross-tab generator would

be just fantastic.

As I understand it, this is what's involved in a pivot:

1. Identify rows that should be grouped (consolidated.)
2. Distinguish the value that identifies each derived column.
3. Distinguish the value that identifies each row-column value.
4. Collapse the rows, build the columns, and populate the 'cells' with

data.

In an ideal world, you would be able to perform different grouping

operations. Such as count, sum, avg, etc.

If there's a way to do this in a system-wide and standards-pointing way,

so much the better.

Apologies if I'm violating list etiquette by jumping in here. I've been

lurking on several Postgres lists for a bit and picking up interesting
details every day. If I've been Unintentionally and Cluelessly Off, I'm
find with being told.

No worries, sir! Apologies on the late reply. I've made some headway on
this item. Waiting for postgres to implement the SQL standard pivoting
(regardless if it implements the cases I need) is not an option for my
personal case. I can't use the SQL approach either as it's very slow and
imposing some scaling limits that need to work around in the short run.

My strategy is to borrow [steal] from crosstab_hash and make a new version
called repivot which takes an arleady pivoted data set and repivots it
against an identified column. Most of the code can be shared with
tablefunc so ideally this could live as an amendment to that extension.
That might not happen though, so I'm going to package it as a separate
extension (removing the majority of tablefunc that I don't need) and submit
it to this group for consideration.

If we punt, it'll end up as a private extension or live the life of an
Orphan in pgxn. If there's some interest here, we can consider a new
contrib extension (which I personally rate very unlikely) or recasting as
an extra routine to tablefunc. Any way we slice it, huge thanks to Joe
Conway&co for giving us such an awesome function to work with all these
years (not to mention the strategic plr language). SRF crosstab() is still
somewhat baroque, but it still fills a niche that nothing else implements.

The interface I'm looking at is:
SELECT repivot(
query TEXT,
static_attributes INT, /* number of static attributes that are unchanged
around key; we need this in our usages */
attribute_query TEXT); /* query that draws up the pivoted attribute list
*/

The driving query is expected to return 0+ static attributes which are
essentially simply pasted to the output. The next two columns are the key
column and the pivoting column. So if you had three attributes, the input
set would be:

a1, a2, a3, k1, p, v1...vn

Where the coordinates v and p would exchange. I need to get this done
quickly and so am trying to avoid more abstracted designs (maybe multi part
keys should be supported through...this is big limitation of crosstab
albeit with some obnoxious work arounds).

merlin

#7Joe Conway
mail@joeconway.com
In reply to: Merlin Moncure (#6)
Re: crosstab/repivot...any interest?

On 2/25/19 8:34 PM, Merlin Moncure wrote:

No worries, sir! Apologies on the late reply.  I've made some headway on
this item.  Waiting for postgres to implement the SQL standard pivoting
(regardless if it implements the cases I need) is not an option for my
personal case. I can't use the SQL approach either as it's very slow and
imposing some scaling limits that need to work around in the short run.

My strategy is to borrow [steal] from crosstab_hash and make a new
version called repivot which takes an arleady pivoted data set and
repivots it against an identified column.   Most of the code can be
shared with tablefunc so ideally this could live as an amendment to that
extension.  That might not happen though, so I'm going to package it as
a separate extension (removing the majority of tablefunc that I don't
need) and submit it to this group for consideration.

I can't promise there will be consensus to add to tablefunc, but I am
not opposed and will be happy to try to help you make that happen to the
extent I can find the spare cycles.

If we punt, it'll end up as a private extension or live the life of an
Orphan in pgxn.  If there's some interest here, we can consider a new
contrib extension (which I personally rate very unlikely) or recasting
as an extra routine to tablefunc.  Any way we slice it, huge thanks to
Joe Conway&co for giving us such an awesome function to work with all
these years (not to mention the strategic plr language).  SRF crosstab()
is still somewhat baroque, but it still fills a niche that nothing else
implements.

The interface I'm looking at is:
SELECT repivot(
  query TEXT,
  static_attributes INT,  /* number of static attributes that are
unchanged around key; we need this in our usages */
  attribute_query  TEXT); /* query that draws up the pivoted attribute
list */

The driving query is expected to return 0+ static attributes which are
essentially simply pasted to the output. The next two columns are the
key column and the pivoting column.   So if you had three attributes,
the input set would be:

a1, a2, a3, k1, p, v1...vn

Where the coordinates v and p would exchange.  I need to get this done
quickly and so am trying to avoid more abstracted designs (maybe multi
part keys should be supported through...this is big limitation of
crosstab albeit with some obnoxious work arounds).

Perhaps not enough coffee yet, but I am not sure I fully grok the design
here. A fully fleshed out example would be useful.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#8Merlin Moncure
mmoncure@gmail.com
In reply to: Joe Conway (#7)
Re: crosstab/repivot...any interest?

On Tue, Feb 26, 2019 at 8:31 AM Joe Conway <mail@joeconway.com> wrote:

On 2/25/19 8:34 PM, Merlin Moncure wrote:

The interface I'm looking at is:
SELECT repivot(
query TEXT,
static_attributes INT, /* number of static attributes that are
unchanged around key; we need this in our usages */
attribute_query TEXT); /* query that draws up the pivoted attribute
list */

The driving query is expected to return 0+ static attributes which are
essentially simply pasted to the output. The next two columns are the
key column and the pivoting column. So if you had three attributes,
the input set would be:

a1, a2, a3, k1, p, v1...vn

Where the coordinates v and p would exchange. I need to get this done
quickly and so am trying to avoid more abstracted designs (maybe multi
part keys should be supported through...this is big limitation of
crosstab albeit with some obnoxious work arounds).

Perhaps not enough coffee yet, but I am not sure I fully grok the design
here. A fully fleshed out example would be useful.

Thanks for your interest.

A lot of our application data is organized in columnar type formats.
Typically, the analytical reports on the back of the database are
organized withe output columns being a KPI or a unit of time.

KPI Columns:
Country,Make,Model,TimeSlice,Sales,Units
US,Ford,Fusion,Y2018Q1,165MM$,250k
US,Ford,Fusion,Y2018Q2,172MM$,261k
US,Ford,Fusion,Y2018Q3,183MM$,268k
...

Time Columns:
Country,Make,Mode, KPI,Y2018Q1,Y2018Q2,Y2018Q3
US,Ford,Fusion,Y2018Q1,Sales,165MM$,172MM$,183MM$
US,Ford,Fusion,Y2018Q2,Units,250k,261k,268k

SELECT repivot(
<query pulling KPI format>,
1, /* only one static attribute */
2, /* number of key columns, only needed if multi part keys are supported */
<query developing list of time columns>);

In this example supporting multi-part, the repivot column is the 4th,
assumed to be attributes first, keys second, pivot column third, data
block last. Multi column pivots might be considered but I don't need
them and that's a significant expansion in scope, so I'm avoiding that
consideration.

What we need to do is convert from the first format above (which is
how data is materialized on table) to the second format. Just like as
within crosstab, if the key column(s) are ordered into the function we
can exploit that structure for an efficient implementation.

'Ford' and 'Fusion' are key columns; 'US' is uninteresting attribute
column (it is unambiguously represented by 'Ford', 'Fusion') and so
would be simply be copied from input to output set.

Our data is materialized in KPI style (which is pretty typical) but we
have large analytical reports that want it with columns representing
units of time. This is farily common in the industry IMO.

There are various pure SQL approaches to do this but they tend to
force you to build out the X,Y columns and then join it all back
together; this can spiral out of control quite quickly from a
performance standpoint. A crosstab style crawling cursor over ordered
set ought to get the job done very efficiently. tablefunc doesn't
support multi part keys today, and the workaround is that you have so
stuff a key into a formatted string and than parse it out for
downstream joining, some needs of having to do that joining might
hopefully be eliminated by allowing the attribute columns to be copied
through.

merlin