spectral datasets in postgresql

Started by Glenn Sullivanover 24 years ago4 messagesgeneral

glenn.sullivan@varianinc.com

over 24 years ago

Hi,

Is it possible/feasible/desirable to use postgresql to hold large spectral
datasets. The dataset would consist of a number of attributes and a
spectrum. The spectrum is simply a list of floating point values. However,
there can be from 128 points to 32 million points for one spectrum.

There would then of course be thousands of these spectra.

I don't know how you would save those 32 million points. If you can do
so, I don't know if it is a feasible way to use postgresql.

I welcome your comments.

Glenn

Nick Fankhauser

nickf@ontko.com

over 24 years ago

In reply to: Glenn Sullivan (#1)

Re: spectral datasets in postgresql

Glenn

If I understand correctly, you're asking if it is possible to represent the
one to many relationship between the spectra & the values they consist of.
The answer is yes- that is one of the things relational databases like
PostgreSQL do well, and what you describe is a fairly simple data structure.

-Nick

--------------------------------------------------------------------------
Nick Fankhauser nickf@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788
Ray Ontko & Co. Software Consulting Services http://www.ontko.com/

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Glenn Sullivan
Sent: Thursday, November 15, 2001 5:32 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] spectral datasets in postgresql

Hi,

Is it possible/feasible/desirable to use postgresql to hold large spectral
datasets. The dataset would consist of a number of attributes and a
spectrum. The spectrum is simply a list of floating point
values. However,
there can be from 128 points to 32 million points for one spectrum.

There would then of course be thousands of these spectra.

I don't know how you would save those 32 million points. If you can do
so, I don't know if it is a feasible way to use postgresql.

I welcome your comments.

Glenn

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Antonio Fiol Bonnín

fiol@w3ping.com

over 24 years ago

In reply to: Nick Fankhauser (#2)

Re: spectral datasets in postgresql

Nick,

I am afraid that Glenn is not as worried about that as he is aabout the fact
that there may be up to 32 million entries per spectrum times 1000+ spectra,
which makes a total of about 3200+ million entries, supposedly on the same
relation.

Maybe I'm off-base, but I am thinking of something like:

CREATE TABLE spectra ( id SERIAL, element VARCHAR(30), ...other data about the
spectra... );
CREATE TABLE data ( id int, value float );

Table spectra would hold 1000+ entries (quite small table, after all).
Table data would hold all the "value"s for all spectra, i.e. 3200+ million
entries, and not even id is unique.

Structure seems fine to me for modest purposes, but not for this one. Of
course, you did not talk about any particular structure, Nick, but I can't
think of another one at the moment.

Of course, what here applies por PostgreSQL seems to me that may also apply for
any other RDBMS I can think of.

Nick, you are certainly a more experienced programmer than I am. Could you
please tell me what I am missing? And maybe give your opinion on the following
proposal? Thank you!!

(After re-reading my e-mail, I have come to the following point: It may be
feasible, as the rows are not very large. I am not sure if the influence of
that is very important or not. In any case, supposing 10byte/row, which is
probably a really low estimate, that makes a table with 32 Terabytes of data.
IIRC, PostgreSQL needs all the data on a particular table to fit on a
filesystem. I guess you will need an enormous partition...)

However, ...

Glenn,

Will you need to search by your "values"? A search like "I would like to know
which spectra have a spectrum "value" between X and Y", I mean...

If you will not need that, you may think of BLOBS to store your spectrum data.
That will avoid you 32 million inserts per spectrum ;-) Even if you issue them
in the same transaction, I guess the performance will be a big issue there.

(Again, re-reading and recalculating, that makes 32 million 32 bits (?)
floating point numbers per BLOB, times 1000. That is, at least, about 1
Terabyte, if I did my calculations properly).

A question has just finally come to my mind: Is compression possible, either on
the original data (transformation to avoid redundant "spectral" information) or
on the float values (gzip compression or something like that)? In that case,
you may save quite a lot of disk space.

Good luck!

Antonio Fiol

P.S. Of course, you need to add any overhead data generated by the RDBMS in
both cases.

Nick Fankhauser wrote:

Show quoted text

Glenn

If I understand correctly, you're asking if it is possible to represent the
one to many relationship between the spectra & the values they consist of.
The answer is yes- that is one of the things relational databases like
PostgreSQL do well, and what you describe is a fairly simple data structure.

-Nick

--------------------------------------------------------------------------
Nick Fankhauser nickf@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788
Ray Ontko & Co. Software Consulting Services http://www.ontko.com/

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Glenn Sullivan
Sent: Thursday, November 15, 2001 5:32 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] spectral datasets in postgresql

Hi,

Is it possible/feasible/desirable to use postgresql to hold large spectral
datasets. The dataset would consist of a number of attributes and a
spectrum. The spectrum is simply a list of floating point
values. However,
there can be from 128 points to 32 million points for one spectrum.

There would then of course be thousands of these spectra.

I don't know how you would save those 32 million points. If you can do
so, I don't know if it is a feasible way to use postgresql.

I welcome your comments.

Glenn

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Nick Fankhauser

nickf@ontko.com

over 24 years ago

In reply to: Antonio Fiol Bonnín (#3)

Re: spectral datasets in postgresql

Antonio-

You're right- I focused on the wide variation of 128->32 million & thought
that was Glenn's question. While I was peripherally aware of the hugeness of
the data, I chocked that up as an equal issue under just about any storage
system and I was indeed thinking of the simple data structure you proposed.
However, as you point out, 32 Billion is big enough to present its own
unique set of problems.

Now that I look at it from this angle, I note that one of the
characteristics of this data is that it is always retrieved in the same
groups, since the float values for one spectrum can belong to no other.
Although I don't know the details, I'd guess that the values are unlikely to
change- maybe Glenn can weigh in with some more information on that.

This being the case, perhaps a better structure would involve storing the
spectra values in BLOBs, and simply having a single table with the other
attributes and OIDs for the BLOBs in them. This would eliminate the need to
scan billions of rows to retrieve values.

However, once we have a structure of this type, it seems almost a waste to
use a relational database.

Glenn- Do the values for a spectrum need to be sorted? that is, do they come
to you out of order & then need to be retrieved in order? Knowing this would
help determine whether storing them in a BLOB with no further organization
is a reasonable approach. Also, do the values need to be changeable, or is
this basically static data?

Antonio- Thanks for setting pointing me at the real issue- I completely
missed it!

-Nick

-----Original Message-----
From: admin@ontko.com [mailto:admin@ontko.com]On Behalf Of Antonio Fiol
Bonnï¿½n
Sent: Friday, November 16, 2001 3:25 AM
To: nickf@ontko.com
Cc: Glenn Sullivan; pgsql-general@postgresql.org
Subject: Re: [GENERAL] spectral datasets in postgresql

Nick,

I am afraid that Glenn is not as worried about that as he is
aabout the fact
that there may be up to 32 million entries per spectrum times
1000+ spectra,
which makes a total of about 3200+ million entries, supposedly on the same
relation.

Maybe I'm off-base, but I am thinking of something like:

CREATE TABLE spectra ( id SERIAL, element VARCHAR(30), ...other
data about the
spectra... );
CREATE TABLE data ( id int, value float );

Table spectra would hold 1000+ entries (quite small table, after all).
Table data would hold all the "value"s for all spectra, i.e. 3200+ million
entries, and not even id is unique.

Structure seems fine to me for modest purposes, but not for this one. Of
course, you did not talk about any particular structure, Nick, but I can't
think of another one at the moment.

Of course, what here applies por PostgreSQL seems to me that may
also apply for
any other RDBMS I can think of.

Nick, you are certainly a more experienced programmer than I am. Could you
please tell me what I am missing? And maybe give your opinion on
the following
proposal? Thank you!!

(After re-reading my e-mail, I have come to the following point: It may be
feasible, as the rows are not very large. I am not sure if the
influence of
that is very important or not. In any case, supposing 10byte/row, which is
probably a really low estimate, that makes a table with 32
Terabytes of data.
IIRC, PostgreSQL needs all the data on a particular table to fit on a
filesystem. I guess you will need an enormous partition...)

However, ...

Glenn,

Will you need to search by your "values"? A search like "I would
like to know
which spectra have a spectrum "value" between X and Y", I mean...

If you will not need that, you may think of BLOBS to store your
spectrum data.
That will avoid you 32 million inserts per spectrum ;-) Even if
you issue them
in the same transaction, I guess the performance will be a big
issue there.

(Again, re-reading and recalculating, that makes 32 million 32 bits (?)
floating point numbers per BLOB, times 1000. That is, at least, about 1
Terabyte, if I did my calculations properly).

A question has just finally come to my mind: Is compression
possible, either on
the original data (transformation to avoid redundant "spectral"
information) or
on the float values (gzip compression or something like that)? In
that case,
you may save quite a lot of disk space.

Good luck!

Antonio Fiol

P.S. Of course, you need to add any overhead data generated by
the RDBMS in
both cases.

Nick Fankhauser wrote:

Glenn

If I understand correctly, you're asking if it is possible to

represent the

one to many relationship between the spectra & the values they

consist of.

The answer is yes- that is one of the things relational databases like
PostgreSQL do well, and what you describe is a fairly simple

data structure.

-Nick

--------------------------------------------------------------------------

Nick Fankhauser nickf@ontko.com Phone 1.765.935.4283 Fax

1.765.962.9788

Ray Ontko & Co. Software Consulting Services

http://www.ontko.com/

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Glenn Sullivan
Sent: Thursday, November 15, 2001 5:32 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] spectral datasets in postgresql

Hi,

Is it possible/feasible/desirable to use postgresql to hold large

spectral

Show quoted text

datasets. The dataset would consist of a number of attributes and a
spectrum. The spectrum is simply a list of floating point
values. However,
there can be from 128 points to 32 million points for one spectrum.

There would then of course be thousands of these spectra.

I don't know how you would save those 32 million points. If you can do
so, I don't know if it is a feasible way to use postgresql.

I welcome your comments.

Glenn

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org