Does people favor to have matrix data type?

Started by KaiGai Koheiabout 10 years ago20 messageshackers

kaigai@ak.jp.nec.com

about 10 years ago

In a few days, I'm working for a data type that represents matrix in
mathematical area. Does people favor to have this data type in the core,
not only my extension?

Like oidvector or int2vector, it is one of array type with a few
restrictions:
- 2 dimensional only
- never contains NULL
- element type is real or float
- no lower bounds of array

A vector in mathematic is a special case of matrix; either 1xN or Nx1.

We can define various operators between matrix/scalar, like:
matrix + matrix -> matrix
matrix - scalar -> matrix
matrix * matrix -> matrix
transform(matrix) -> matrix
:

How about people's thought?

If overall consensus is welcome to have, I'll set up a patch.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pavel.stehule@gmail.com

about 10 years ago

In reply to: KaiGai Kohei (#1)

Re: Does people favor to have matrix data type?

Hi

2016-05-25 4:52 GMT+02:00 Kouhei Kaigai <kaigai@ak.jp.nec.com>:

In a few days, I'm working for a data type that represents matrix in
mathematical area. Does people favor to have this data type in the core,
not only my extension?

Like oidvector or int2vector, it is one of array type with a few
restrictions:
- 2 dimensional only
- never contains NULL
- element type is real or float
- no lower bounds of array

A vector in mathematic is a special case of matrix; either 1xN or Nx1.

We can define various operators between matrix/scalar, like:
matrix + matrix -> matrix
matrix - scalar -> matrix
matrix * matrix -> matrix
transform(matrix) -> matrix
:

How about people's thought?

It is far to data processing - so it should be placed in PGXN.

Regards

Pavel

Show quoted text

If overall consensus is welcome to have, I'll set up a patch.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

simon@2ndQuadrant.com

about 10 years ago

In reply to: KaiGai Kohei (#1)

Re: Does people favor to have matrix data type?

On 25 May 2016 at 03:52, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In a few days, I'm working for a data type that represents matrix in
mathematical area. Does people favor to have this data type in the core,
not only my extension?

If we understood the use case, it might help understand whether to include
it or not.

Multi-dimensionality of arrays isn't always useful, so this could be good.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Simon Riggs (#3)

Re: Does people favor to have matrix data type?

-----Original Message-----
From: Simon Riggs [mailto:simon@2ndQuadrant.com]
Sent: Wednesday, May 25, 2016 4:39 PM
To: Kaigai Kouhei(海外浩平)
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Does people favor to have matrix data type?

On 25 May 2016 at 03:52, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In a few days, I'm working for a data type that represents matrix in
mathematical area. Does people favor to have this data type in the core,
not only my extension?

If we understood the use case, it might help understand whether to include it or not.

Multi-dimensionality of arrays isn't always useful, so this could be good.

As you may expect, the reason why I've worked for matrix data type is one of
the groundwork for GPU acceleration, but not limited to.

What I tried to do is in-database calculation of some analytic algorithm; not
exporting entire dataset to client side.
My first target is k-means clustering; often used to data mining.
When we categorize N-items which have M-attributes into k-clusters, the master
data can be shown in NxM matrix; that is equivalent to N vectors in M-dimension.
The cluster centroid is also located inside of the M-dimension space, so it
can be shown in kxM matrix; that is equivalent to k vectors in M-dimension.
The k-means algorithm requires to calculate the distance to any cluster centroid
for each items, thus, it produces Nxk matrix; that is usually called as distance
matrix. Next, it updates the cluster centroid using the distance matrix, then
repeat the entire process until convergence.

The heart of workload is calculation of distance matrix. When I tried to write
k-means algorithm using SQL + R, its performance was not sufficient (poor).
https://github.com/kaigai/toybox/blob/master/Rstat/pgsql-kmeans.r

If we would have native functions we can use instead of the complicated SQL
expression, it will make sense for people who tries in-database analytics.

Also, fortunately, PostgreSQL's 2-D array format is binary compatible to BLAS
library's requirement. It will allow GPU to process large matrix in HPC grade
performance.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

ants.aasma@cybertec.at

about 10 years ago

In reply to: KaiGai Kohei (#1)

Re: Does people favor to have matrix data type?

On Wed, May 25, 2016 at 10:38 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 25 May 2016 at 03:52, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In a few days, I'm working for a data type that represents matrix in
mathematical area. Does people favor to have this data type in the core,
not only my extension?

If we understood the use case, it might help understand whether to include
it or not.

Multi-dimensionality of arrays isn't always useful, so this could be good.

Many natural language and image processing methods extract feature
vectors that then use some simple distance metric, like dot product to
calculate vector similarity. For example we presented a latent
semantic analysis prototype at pgconf.eu 2015 that used real[] to
store the features and a dotproduct(real[], real[]) real function to
do similarity matching. However using real[] instead of a hypothetical
realvector or realmatrix did not prove to be a huge overhead, so
overall I'm on the fence for the usefulness of a special type. Maybe a
helper function or two to validate the additional restrictions in a
check constraint would be enough.

Regards,
Ants Aasma

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 1390777599.86015.1464161953684@RIA

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Ants Aasma (#5)

Re: Does people favor to have matrix data type?

On Wed, May 25, 2016 at 10:38 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 25 May 2016 at 03:52, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In a few days, I'm working for a data type that represents matrix in
mathematical area. Does people favor to have this data type in the core,
not only my extension?

If we understood the use case, it might help understand whether to include
it or not.

Multi-dimensionality of arrays isn't always useful, so this could be good.

Many natural language and image processing methods extract feature
vectors that then use some simple distance metric, like dot product to
calculate vector similarity. For example we presented a latent
semantic analysis prototype at pgconf.eu 2015 that used real[] to
store the features and a dotproduct(real[], real[]) real function to
do similarity matching. However using real[] instead of a hypothetical
realvector or realmatrix did not prove to be a huge overhead, so
overall I'm on the fence for the usefulness of a special type. Maybe a
helper function or two to validate the additional restrictions in a
check constraint would be enough.

The 'matrix' data type as domain type of real[] is an option to implement.
We can define operators on the domain types, thus, it allows us to process
large amount of calculation by one operation, in native binary speed.

My only concern is that domain type is not allowed to define type cast.
If we could add type cast on domain, we can define type transformation from
other array type to matrix.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kenneth Marshall

ktm@rice.edu

about 10 years ago

In reply to: KaiGai Kohei (#4)

Re: Does people favor to have matrix data type?

On Wed, May 25, 2016 at 09:10:02AM +0000, Kouhei Kaigai wrote:

-----Original Message-----
From: Simon Riggs [mailto:simon@2ndQuadrant.com]
Sent: Wednesday, May 25, 2016 4:39 PM
To: Kaigai Kouhei(海外浩平)
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Does people favor to have matrix data type?

On 25 May 2016 at 03:52, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In a few days, I'm working for a data type that represents matrix in
mathematical area. Does people favor to have this data type in the core,
not only my extension?

If we understood the use case, it might help understand whether to include it or not.

Multi-dimensionality of arrays isn't always useful, so this could be good.

As you may expect, the reason why I've worked for matrix data type is one of
the groundwork for GPU acceleration, but not limited to.

What I tried to do is in-database calculation of some analytic algorithm; not
exporting entire dataset to client side.
My first target is k-means clustering; often used to data mining.
When we categorize N-items which have M-attributes into k-clusters, the master
data can be shown in NxM matrix; that is equivalent to N vectors in M-dimension.
The cluster centroid is also located inside of the M-dimension space, so it
can be shown in kxM matrix; that is equivalent to k vectors in M-dimension.
The k-means algorithm requires to calculate the distance to any cluster centroid
for each items, thus, it produces Nxk matrix; that is usually called as distance
matrix. Next, it updates the cluster centroid using the distance matrix, then
repeat the entire process until convergence.

The heart of workload is calculation of distance matrix. When I tried to write
k-means algorithm using SQL + R, its performance was not sufficient (poor).
https://github.com/kaigai/toybox/blob/master/Rstat/pgsql-kmeans.r

If we would have native functions we can use instead of the complicated SQL
expression, it will make sense for people who tries in-database analytics.

Also, fortunately, PostgreSQL's 2-D array format is binary compatible to BLAS
library's requirement. It will allow GPU to process large matrix in HPC grade
performance.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Hi,

Have you looked at Perl Data Language under pl/perl? It has pretty nice support
for matrix calculations:

http://pdl.perl.org

Regards,
Ken

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: KaiGai Kohei (#6)

Re: Does people favor to have matrix data type?

Kouhei Kaigai <kaigai@ak.jp.nec.com> writes:

The 'matrix' data type as domain type of real[] is an option to implement.
We can define operators on the domain types, thus, it allows us to process
large amount of calculation by one operation, in native binary speed.

Don't go that way, it will cause you nothing but pain. The ambiguous-
operator resolution rules are not friendly at all to operators with domain
arguments; basically only exact matches will be recognized.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jim.Nasby@BlueTreble.com

about 10 years ago

In reply to: KaiGai Kohei (#6)

Re: Does people favor to have matrix data type?

On 5/25/16 7:46 AM, Kouhei Kaigai wrote:

My only concern is that domain type is not allowed to define type cast.
If we could add type cast on domain, we can define type transformation from
other array type to matrix.

I've actually wished for that in the past, as well as casting to
compound types. Having those would make it easier to mock up a real data
type for experimentation.

I strongly encourage you to talk to the MADlib community about
first-class matrix support. They currently emulate matrices via arrays.
I don't know offhand if they support NULLs in their regular matrices.
They also support a sparsematrix "type" that is actually implemented as
a table that contains coordinates and a value for each value in the
matrix. Having that as a type might also be interesting (if you're
sparse enough, that will be cheaper than the current NULL bitmap
implementation).

Related to this, Tom has mentioned in the past that perhaps we should
support abstract use of the [] construct. Currently point finds a way to
make use of [], but I think that's actually coded into the grammar.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Jim Nasby (#9)

Re: Does people favor to have matrix data type?

On 5/25/16 7:46 AM, Kouhei Kaigai wrote:

My only concern is that domain type is not allowed to define type cast.
If we could add type cast on domain, we can define type transformation from
other array type to matrix.

I've actually wished for that in the past, as well as casting to
compound types. Having those would make it easier to mock up a real data
type for experimentation.

I strongly encourage you to talk to the MADlib community about
first-class matrix support. They currently emulate matrices via arrays.

A MADLib folks contacted me in the past; when I initially made PG-Strom
several years before, however, no communication with them after that.

https://madlib.incubator.apache.org/docs/v1.8/group__grp__arraysmatrix.html
According to the documentation, their approach is different from what
I'd like to implement. They use a table as if a matrix. Because of its
data format, we have to adjust position of the data element to fit
requirement by high performance matrix libraries (like cuBLAS).

I don't know offhand if they support NULLs in their regular matrices.
They also support a sparsematrix "type" that is actually implemented as
a table that contains coordinates and a value for each value in the
matrix. Having that as a type might also be interesting (if you're
sparse enough, that will be cheaper than the current NULL bitmap
implementation).

Sparse matrix! It is a disadvantaged area for the current array format.

I have two ideas. HPC folks often split a large matrix into multiple
grid. A grid is typically up to 1024x1024 matrix, for example.
If a grid is consists of all zero elements, it is obvious we don't need
to have individual elements on the grid.
One other idea is compression. If most of matrix is zero, it is an ideal
data for compression, and it is easy to reconstruct only when calculation.

Related to this, Tom has mentioned in the past that perhaps we should
support abstract use of the [] construct. Currently point finds a way to
make use of [], but I think that's actually coded into the grammar.

Yep, if we consider 2D-array is matrix, no special enhancement is needed
to use []. However, I'm inclined to have own data structure for matrix
to present the sparse matrix.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mail@joeconway.com

about 10 years ago

In reply to: KaiGai Kohei (#10)

Re: Does people favor to have matrix data type?

On 05/28/2016 07:12 AM, Kouhei Kaigai wrote:

Sparse matrix! It is a disadvantaged area for the current array format.

I have two ideas. HPC folks often split a large matrix into multiple
grid. A grid is typically up to 1024x1024 matrix, for example.
If a grid is consists of all zero elements, it is obvious we don't need
to have individual elements on the grid.
One other idea is compression. If most of matrix is zero, it is an ideal
data for compression, and it is easy to reconstruct only when calculation.

Related to this, Tom has mentioned in the past that perhaps we should
support abstract use of the [] construct. Currently point finds a way to
make use of [], but I think that's actually coded into the grammar.

Yep, if we consider 2D-array is matrix, no special enhancement is needed
to use []. However, I'm inclined to have own data structure for matrix
to present the sparse matrix.

+1 I'm sure this would be useful for PL/R as well.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Joe Conway (#11)

Re: Does people favor to have matrix data type?

-----Original Message-----
From: Joe Conway [mailto:mail@joeconway.com]
Sent: Sunday, May 29, 2016 1:40 AM
To: Kaigai Kouhei(海外浩平); Jim Nasby; Ants Aasma; Simon Riggs
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Does people favor to have matrix data type?

On 05/28/2016 07:12 AM, Kouhei Kaigai wrote:

Sparse matrix! It is a disadvantaged area for the current array format.

I have two ideas. HPC folks often split a large matrix into multiple
grid. A grid is typically up to 1024x1024 matrix, for example.
If a grid is consists of all zero elements, it is obvious we don't need
to have individual elements on the grid.
One other idea is compression. If most of matrix is zero, it is an ideal
data for compression, and it is easy to reconstruct only when calculation.

Related to this, Tom has mentioned in the past that perhaps we should
support abstract use of the [] construct. Currently point finds a way to
make use of [], but I think that's actually coded into the grammar.

Yep, if we consider 2D-array is matrix, no special enhancement is needed
to use []. However, I'm inclined to have own data structure for matrix
to present the sparse matrix.

+1 I'm sure this would be useful for PL/R as well.

Joe

It is pretty good idea to combine PL/R and PL/CUDA (what I'm now working)
for advanced analytics. We will be able to off-load heavy computing portion
to GPU, then also utilize various R functions inside database.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mail@joeconway.com

about 10 years ago

In reply to: KaiGai Kohei (#12)

Re: Does people favor to have matrix data type?

On 05/28/2016 03:33 PM, Kouhei Kaigai wrote:

-----Original Message-----
From: Joe Conway [mailto:mail@joeconway.com]
Sent: Sunday, May 29, 2016 1:40 AM
To: Kaigai Kouhei(海外浩平); Jim Nasby; Ants Aasma; Simon Riggs
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Does people favor to have matrix data type?

On 05/28/2016 07:12 AM, Kouhei Kaigai wrote:

Sparse matrix! It is a disadvantaged area for the current array format.

I have two ideas. HPC folks often split a large matrix into multiple
grid. A grid is typically up to 1024x1024 matrix, for example.
If a grid is consists of all zero elements, it is obvious we don't need
to have individual elements on the grid.
One other idea is compression. If most of matrix is zero, it is an ideal
data for compression, and it is easy to reconstruct only when calculation.

Related to this, Tom has mentioned in the past that perhaps we should
support abstract use of the [] construct. Currently point finds a way to
make use of [], but I think that's actually coded into the grammar.

Yep, if we consider 2D-array is matrix, no special enhancement is needed
to use []. However, I'm inclined to have own data structure for matrix
to present the sparse matrix.

+1 I'm sure this would be useful for PL/R as well.

Joe

It is pretty good idea to combine PL/R and PL/CUDA (what I'm now working)
for advanced analytics. We will be able to off-load heavy computing portion
to GPU, then also utilize various R functions inside database.

Agreed. Perhaps at some point we should discuss closer integration of
some sort, or at least a sample use case.

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Joe Conway (#13)

Re: Does people favor to have matrix data type?

On 05/28/2016 03:33 PM, Kouhei Kaigai wrote:

-----Original Message-----
From: Joe Conway [mailto:mail@joeconway.com]
Sent: Sunday, May 29, 2016 1:40 AM
To: Kaigai Kouhei(海外浩平); Jim Nasby; Ants Aasma; Simon Riggs
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Does people favor to have matrix data type?

On 05/28/2016 07:12 AM, Kouhei Kaigai wrote:

Sparse matrix! It is a disadvantaged area for the current array format.

I have two ideas. HPC folks often split a large matrix into multiple
grid. A grid is typically up to 1024x1024 matrix, for example.
If a grid is consists of all zero elements, it is obvious we don't need
to have individual elements on the grid.
One other idea is compression. If most of matrix is zero, it is an ideal
data for compression, and it is easy to reconstruct only when calculation.

Related to this, Tom has mentioned in the past that perhaps we should
support abstract use of the [] construct. Currently point finds a way to
make use of [], but I think that's actually coded into the grammar.

Yep, if we consider 2D-array is matrix, no special enhancement is needed
to use []. However, I'm inclined to have own data structure for matrix
to present the sparse matrix.

+1 I'm sure this would be useful for PL/R as well.

Joe

It is pretty good idea to combine PL/R and PL/CUDA (what I'm now working)
for advanced analytics. We will be able to off-load heavy computing portion
to GPU, then also utilize various R functions inside database.

Agreed. Perhaps at some point we should discuss closer integration of
some sort, or at least a sample use case.

What I'm trying to implement first is k-means clustering by GPU. It core workload
is iteration of massive distance calculations. When I run kmeans() function of R
for million items with 10 clusters on 40 dimensions, it took about thousand seconds.
If GPU version provides the result matrix more rapidly, then I expect R can plot
relationship between items and clusters in human friendly way.

For the closer integration, it may be valuable if PL/R and PL/CUDA can exchange
the data structure with no serialization/de-serialization when PL/R code tries
to call SQL functions. IIUC, pg.spi.exec("SELECT my_function(...)") is the only
way to call SQL functions inside PL/R scripts.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mail@joeconway.com

about 10 years ago

In reply to: KaiGai Kohei (#14)

Re: Does people favor to have matrix data type?

On 05/29/2016 04:55 PM, Kouhei Kaigai wrote:

For the closer integration, it may be valuable if PL/R and PL/CUDA can exchange
the data structure with no serialization/de-serialization when PL/R code tries
to call SQL functions.

I had been thinking about something similar. Maybe PL/R can create an
extension within the R environment that wraps PL/CUDA directly or at the
least provides a way to use a fast-path call. We should probably try to
start out with one common use case to see how it might work and how much
benefit there might be.

IIUC, pg.spi.exec("SELECT my_function(...)") is the only way to call SQL functions inside PL/R scripts.

Correct (currently).

BTW, this is starting to drift off topic I think -- perhaps we should
continue off list?

Thanks,

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Joe Conway (#15)

Re: Does people favor to have matrix data type?

On 05/29/2016 04:55 PM, Kouhei Kaigai wrote:

For the closer integration, it may be valuable if PL/R and PL/CUDA can exchange
the data structure with no serialization/de-serialization when PL/R code tries
to call SQL functions.

I had been thinking about something similar. Maybe PL/R can create an
extension within the R environment that wraps PL/CUDA directly or at the
least provides a way to use a fast-path call. We should probably try to
start out with one common use case to see how it might work and how much
benefit there might be.

My thought is the second option above. If SPI interface supports fast-path
like 'F' protocol, it may become a natural way for other PLs also to
integrate SQL functions in other languages.

IIUC, pg.spi.exec("SELECT my_function(...)") is the only way to call SQL functions

inside PL/R scripts.

Correct (currently).

BTW, this is starting to drift off topic I think -- perhaps we should
continue off list?

Some elements are common for PostgreSQL (matrix data type and fastpath SPI
interface). I like to keep the discussion on the list.
Regarding to the PoC on a particular use case, it might be an off-list
discussion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

GavinFlower@archidevsys.co.nz

about 10 years ago

In reply to: KaiGai Kohei (#16)

Re: Does people favor to have matrix data type?

On 31/05/16 12:01, Kouhei Kaigai wrote:

On 05/29/2016 04:55 PM, Kouhei Kaigai wrote:

For the closer integration, it may be valuable if PL/R and PL/CUDA can exchange
the data structure with no serialization/de-serialization when PL/R code tries
to call SQL functions.

I had been thinking about something similar. Maybe PL/R can create an
extension within the R environment that wraps PL/CUDA directly or at the
least provides a way to use a fast-path call. We should probably try to
start out with one common use case to see how it might work and how much
benefit there might be.

My thought is the second option above. If SPI interface supports fast-path
like 'F' protocol, it may become a natural way for other PLs also to
integrate SQL functions in other languages.

IIUC, pg.spi.exec("SELECT my_function(...)") is the only way to call SQL functions

inside PL/R scripts.

Correct (currently).

BTW, this is starting to drift off topic I think -- perhaps we should
continue off list?

Some elements are common for PostgreSQL (matrix data type and fastpath SPI
interface). I like to keep the discussion on the list.
Regarding to the PoC on a particular use case, it might be an off-list
discussion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Possibly there should be two matrix types in Postgres: the first would
be the default and optimized for small dense matrices, the second would
store large sparse matrices efficiently in memory at the expensive of
speed (possibly with one or more parameters relating to how sparse it is
likely to be?) - for appropriate definitions 'small' & 'large', though
memory savings for the latter type might not kick in unless the matrices
are big enough (so a small sparse matrix might consume more memory than
a nominally larger dense matrix type & a sparse matrix might have to be
sufficiently sparse to make real memory savings at any size).

Probably good to think of 2 types at the start, even if the only the
first is implemented initially.

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Gavin Flower (#17)

Re: Does people favor to have matrix data type?

-----Original Message-----
From: Gavin Flower [mailto:GavinFlower@archidevsys.co.nz]
Sent: Tuesday, May 31, 2016 9:47 AM
To: Kaigai Kouhei(海外浩平); Joe Conway; Jim Nasby; Ants Aasma; Simon Riggs
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Does people favor to have matrix data type?

On 31/05/16 12:01, Kouhei Kaigai wrote:

On 05/29/2016 04:55 PM, Kouhei Kaigai wrote:

For the closer integration, it may be valuable if PL/R and PL/CUDA can exchange
the data structure with no serialization/de-serialization when PL/R code tries
to call SQL functions.

I had been thinking about something similar. Maybe PL/R can create an
extension within the R environment that wraps PL/CUDA directly or at the
least provides a way to use a fast-path call. We should probably try to
start out with one common use case to see how it might work and how much
benefit there might be.

My thought is the second option above. If SPI interface supports fast-path
like 'F' protocol, it may become a natural way for other PLs also to
integrate SQL functions in other languages.

IIUC, pg.spi.exec("SELECT my_function(...)") is the only way to call SQL functions

inside PL/R scripts.

Correct (currently).

BTW, this is starting to drift off topic I think -- perhaps we should
continue off list?

Some elements are common for PostgreSQL (matrix data type and fastpath SPI
interface). I like to keep the discussion on the list.
Regarding to the PoC on a particular use case, it might be an off-list
discussion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Possibly there should be two matrix types in Postgres: the first would
be the default and optimized for small dense matrices, the second would
store large sparse matrices efficiently in memory at the expensive of
speed (possibly with one or more parameters relating to how sparse it is
likely to be?) - for appropriate definitions 'small' & 'large', though
memory savings for the latter type might not kick in unless the matrices
are big enough (so a small sparse matrix might consume more memory than
a nominally larger dense matrix type & a sparse matrix might have to be
sufficiently sparse to make real memory savings at any size).

One idea in my mind is that a sparse matrix is represented as a grid
of a smaller matrixes, and omit all-zero area. It looks like indirect
pointer reference. The header of matrix type has offset values to
each grid. If offset == 0, it means this grid contains all-zero.

Due to performance reason, location of each element must be deterministic
without walking on the data structure. This approach guarantees we can
reach individual element with 2 steps.

A flat matrix can be represented as a special case of the sparse matrix.
If entire matrix is consists of 1x1 grid, it is a flat matrix.
We may not need to define two individual data types.

Probably good to think of 2 types at the start, even if the only the
first is implemented initially.

I agree.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jim.Nasby@BlueTreble.com

about 10 years ago

In reply to: KaiGai Kohei (#18)

Re: Does people favor to have matrix data type?

On 5/30/16 9:05 PM, Kouhei Kaigai wrote:

Due to performance reason, location of each element must be deterministic
without walking on the data structure. This approach guarantees we can
reach individual element with 2 steps.

Agreed.

On various other points...

Yes, please keep the discussion here, even when it relates only to PL/R.
Whatever is being done for R needs to be done for plpython as well. I've
looked at ways to improve analytics in plpython related to this, and it
looks like I need to take a look at the fast-path function stuff. One of
the things I've pondered for storing ndarrays in Postgres is how to
reduce or eliminate the need to copy data from one memory region to
another. It would be nice if there was a way to take memory that was
allocated by one manager (ie: python's) and transfer ownership of that
memory directly to Postgres without having to copy everything. Obviously
you'd want to go the other way as well. IIRC cython's memory manager is
the same as palloc in regard to very large allocations basically being
ignored completely, so this should be possible in that case.

One thing I don't understand is why this type needs to be limited to 1
or 2 dimensions? Isn't the important thing how many individual elements
you can fit into GPU? So if you can fit a 1024x1024, you could also fit
a 100x100x100, a 32x32x32x32, etc. At low enough values maybe that stops
making sense, but I don't see why there needs to be an artificial limit.
I think what's important for something like kNN is that the storage is
optimized for this, which I think means treating the highest dimension
as if it was a list. I don't know if it then matters whither the lower
dimensions are C style vs FORTRAN style. Other algorithms might want
different storage.

Something else to consider is the 1G toast limit. I'm pretty sure that's
why MADlib stores matricies as a table of vectors. I know for certain
it's a problem they run into, because they've discussed it on their
mailing list.

BTW, take a look at MADlib svec[1]https://madlib.incubator.apache.org/docs/v1.8/group__grp__svec.html -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461... ISTM that's just a special case of
what you're describing with entire grids being zero (or vice-versa).
There might be some commonality there.

[1]: https://madlib.incubator.apache.org/docs/v1.8/group__grp__svec.html -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Jim Nasby (#19)

Re: Does people favor to have matrix data type?

-----Original Message-----
From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com]
Sent: Wednesday, June 01, 2016 11:32 PM
To: Kaigai Kouhei(海外浩平); Gavin Flower; Joe Conway; Ants Aasma; Simon Riggs
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Does people favor to have matrix data type?

On 5/30/16 9:05 PM, Kouhei Kaigai wrote:

Due to performance reason, location of each element must be deterministic
without walking on the data structure. This approach guarantees we can
reach individual element with 2 steps.

Agreed.

On various other points...

Yes, please keep the discussion here, even when it relates only to PL/R.
Whatever is being done for R needs to be done for plpython as well. I've
looked at ways to improve analytics in plpython related to this, and it
looks like I need to take a look at the fast-path function stuff. One of
the things I've pondered for storing ndarrays in Postgres is how to
reduce or eliminate the need to copy data from one memory region to
another. It would be nice if there was a way to take memory that was
allocated by one manager (ie: python's) and transfer ownership of that
memory directly to Postgres without having to copy everything. Obviously
you'd want to go the other way as well. IIRC cython's memory manager is
the same as palloc in regard to very large allocations basically being
ignored completely, so this should be possible in that case.

One thing I don't understand is why this type needs to be limited to 1
or 2 dimensions? Isn't the important thing how many individual elements
you can fit into GPU? So if you can fit a 1024x1024, you could also fit
a 100x100x100, a 32x32x32x32, etc. At low enough values maybe that stops
making sense, but I don't see why there needs to be an artificial limit.

Simply, I didn't looked at the matrix larger than 2-dimensional.
Is it a valid mathematic concept?
Because of the nature of GPU, it is designed to map threads on X-, Y-,
and Z-axis. However, not limited to 3-dimensions, because programmer can
handle upper 10bit of X-axis as 4th dimension for example.

I think what's important for something like kNN is that the storage is
optimized for this, which I think means treating the highest dimension
as if it was a list. I don't know if it then matters whither the lower
dimensions are C style vs FORTRAN style. Other algorithms might want
different storage.

FORTRAN style is preferable, because it allows to use BLAS library without
data format transformation.
I'm not sure why you prefer a list structure on the highest dimension.
A simple lookup table is enough and suitable for massive parallel processors.

Something else to consider is the 1G toast limit. I'm pretty sure that's
why MADlib stores matricies as a table of vectors. I know for certain
it's a problem they run into, because they've discussed it on their
mailing list.

BTW, take a look at MADlib svec[1]... ISTM that's just a special case of
what you're describing with entire grids being zero (or vice-versa).
There might be some commonality there.

[1] https://madlib.incubator.apache.org/docs/v1.8/group__grp__svec.html

Once we try to deal with a table as representation of matrix, it goes
beyond the scope of data type in PostgreSQL. Likely, users have to take
something special jobs to treat it, more than operator functions that
support matrix data types.

For a matrix larger than toast limit, my thought is that; a large matrix
which is consists of multiple grid can have multiple toast pointers for
each sub-matrix. Although individual sub-matrix must be up to 1GB, we can
represent an entire matrix as a set of grid more than 1GB.

As I wrote in the previous message, a matrix structure head will have offset
to each sub-matrix of grid. The sub-matrix will be inline, or external toast
if VARATT_IS_1B_E(ptr).
Probably, we also have to hack row deletion code not to leak sub-matrix in
the toast table.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--

Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers