On using doubles as primary keys

Started by Kynn Jonesalmost 11 years ago7 messagesgeneral

kynnjo@gmail.com

almost 11 years ago

I have some data in the form of a matrix of doubles (~2 million
rows, ~400 columns) that I would like to store in a Pg table,
along with the associated table of metadata (same number of rows,
~30 columns, almost all text). This is large enough to make
working with it from flat files unwieldy.

(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL
queries to extract subsets of it, than by, e.g., writing programs
that must scan the entire matrix every time they're run.)

My plan is to add one extra column to the Pg tables to serve as
the primary key joining the data and the metadata tables
together.

One consideration that is complication the choice of primary key
is wanting to have the ability to store chunks of the data
table (not the metadata table), including the PK column, as
matrices of doubles. In its simplest form, this would mean using
doubles as primary keys, but this seems to me a bit weird.

I'm willing to go ahead with this, but I wanted to ask for your
feedback on the whole thing. In particular I'd like to know if
there are there standard ways for using doubles as primary keys,
and, conversely, if there are known pitfalls I should be looking
out for, but I welcome any other words of wisdom you may have on
this topic.

Thanks in advance!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Melvin Davidson

melvin6925@gmail.com

almost 11 years ago

In reply to: Kynn Jones (#1)

Re: On using doubles as primary keys

First, please ALWAYS include the version and O/S, even with basic questions.

I'm not sure what you mean by doubles. Do you mean bigint data type, or do
you mean use two columns for a primary key? Either way it's pretty simple.
If you mean a bigint, then probably best to use serial data type, which
will default to the next value.
If you mean use two columns for a Primary Key, the you just specify the
columns.
EG: CONSTRAINT PRIMARY KEY table_name_pk PRIMARY KEY (col1, col2)

On Fri, Apr 17, 2015 at 11:34 AM, Kynn Jones <kynnjo@gmail.com> wrote:

I have some data in the form of a matrix of doubles (~2 million
rows, ~400 columns) that I would like to store in a Pg table,
along with the associated table of metadata (same number of rows,
~30 columns, almost all text). This is large enough to make
working with it from flat files unwieldy.

(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL
queries to extract subsets of it, than by, e.g., writing programs
that must scan the entire matrix every time they're run.)

My plan is to add one extra column to the Pg tables to serve as
the primary key joining the data and the metadata tables
together.

One consideration that is complication the choice of primary key
is wanting to have the ability to store chunks of the data
table (not the metadata table), including the PK column, as
matrices of doubles. In its simplest form, this would mean using
doubles as primary keys, but this seems to me a bit weird.

I'm willing to go ahead with this, but I wanted to ask for your
feedback on the whole thing. In particular I'd like to know if
there are there standard ways for using doubles as primary keys,
and, conversely, if there are known pitfalls I should be looking
out for, but I welcome any other words of wisdom you may have on
this topic.

Thanks in advance!

kj

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

David G. Johnston

david.g.johnston@gmail.com

almost 11 years ago

In reply to: Melvin Davidson (#2)

Re: On using doubles as primary keys

On Fri, Apr 17, 2015 at 8:45 AM, Melvin Davidson <melvin6925@gmail.com>
wrote:

On Fri, Apr 17, 2015 at 11:34 AM, Kynn Jones <kynnjo@gmail.com> wrote:

One consideration that is complication the choice of primary key
is wanting to have the ability to store chunks of the data
table (not the metadata table), including the PK column, as
matrices of doubles. In its simplest form, this would mean using
doubles as primary keys, but this seems to me a bit weird.

I'm willing to go ahead with this, but I wanted to ask for your
feedback on the whole thing. In particular I'd like to know if
there are there standard ways for using doubles as primary keys,
and, conversely, if there are known pitfalls I should be looking
out for, but I welcome any other words of wisdom you may have on
this topic.

MD>>

First, please ALWAYS include the version and O/S, even with basic questions.

MD>>

I'm not sure what you mean by doubles. Do you mean bigint data type, or do
you mean use two columns for a primary key? Either way it's pretty simple.

MD>>

If you mean a bigint, then probably best to use serial data type, which
will default to the next value.

MD>>

If you mean use two columns for a Primary Key, the you just specify the
columns.

MD>>

EG: CONSTRAINT PRIMARY KEY table_name_pk PRIMARY KEY (col1, col2)

I take it the OP means "double precision"

http://www.postgresql.org/docs/9.0/static/datatype-numeric.html

I'd be suspect of choosing a PK whose type definition includes the word
"inexact".

You also say you want to store the data as a double precision but you never
actually explain what the natural key of the data is.

If you are strictly storing serial (big) integers but doing so within a
double precision typed column you may be OK - but I would advise taking
input from someone with more expertise on the properties of IEEE floating
point numbers.

David J.

Ray Cote

rgacote@appropriatesolutions.com

almost 11 years ago

In reply to: David G. Johnston (#3)

Re: On using doubles as primary keys

On Fri, Apr 17, 2015 at 11:56 AM, David G. Johnston <
david.g.johnston@gmail.com> wrote:

MD>>

I'm not sure what you mean by doubles. Do you mean bigint data type, or do
you mean use two columns for a primary key? Either way it's pretty simple.

MD>>

If you mean a bigint, then probably best to use serial data type, which
will default to the next value.

MD>>

If you mean use two columns for a Primary Key, the you just specify the
columns.

MD>>

EG: CONSTRAINT PRIMARY KEY table_name_pk PRIMARY KEY (col1, col2)

I take it the OP means "double precision"

http://www.postgresql.org/docs/9.0/static/datatype-numeric.html

I'd be suspect of choosing a PK whose type definition includes the word
"inexact".

You also say you want to store the data as a double precision but you
never actually explain what the natural key of the data is.

If you are strictly storing serial (big) integers but doing so within a
double precision typed column you may be OK - but I would advise taking
input from someone with more expertise on the properties of IEEE floating
point numbers.

David J.

(Not an IEEE floating point expert, but...) I've learned the hard way to
never rely on comparing two floating point numbers for equality -- and
that's what you are doing if you join on them as primary keys. If you must
use the underlying numeric data for joining, I'd recommend you do something
like:
* output the value to a string (rounded to a specific number of digits)
* use the textual representation of the number as your primary key.

Alternatively, if the numbers are very large or very small, you may want to
try converting the binary data into a BINHEX value and use that textual
representation as the primary key.

Either way, I think you'd be saving yourself a lot of headaches using a
representation that is not 'inexact.'
--Ray

--
Raymond Cote, President
voice: +1.603.924.6079 email: rgacote@AppropriateSolutions.com skype:
ray.cote

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 11 years ago

In reply to: Ray Cote (#4)

Re: On using doubles as primary keys

On 4/17/15 1:10 PM, Ray Cote wrote:

(Not an IEEE floating point expert, but...) I've learned the hard way to
never rely on comparing two floating point numbers for equality -- and
that's what you are doing if you join on them as primary keys. If you
must use the underlying numeric data for joining, I'd recommend you do
something like:
* output the value to a string (rounded to a specific number of digits)
* use the textual representation of the number as your primary key.

numeric would probably be more efficient, if you can use that instead.
It does suffer from some of the same issues as floating point (there's
been recent discussion of that on -hackers), but I believe it'd be
better than double.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

John McKown

john.archie.mckown@gmail.com

almost 11 years ago

In reply to: Kynn Jones (#1)

Re: On using doubles as primary keys

On Fri, Apr 17, 2015 at 10:34 AM, Kynn Jones <kynnjo@gmail.com> wrote:

I have some data in the form of a matrix of doubles (~2 million
rows, ~400 columns) that I would like to store in a Pg table,
along with the associated table of metadata (same number of rows,
~30 columns, almost all text). This is large enough to make
working with it from flat files unwieldy.

(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL
queries to extract subsets of it, than by, e.g., writing programs
that must scan the entire matrix every time they're run.)

My plan is to add one extra column to the Pg tables to serve as
the primary key joining the data and the metadata tables
together.

I could be wrong, but it seems to me that the records in the two files
have a one to one correspondence with each other. That is, record #1 in
file #1 has data related to record #1 in file #2. And that there should
therefore always be the same number of records in both files. So my first
thought is why not just have one table in PostgreSQL which contains the
data merged from the corresponding records in the two files? IMO, this is a
cleaner data design than trying to ensure that two tables are kept "in
sync" as records are added and deleted. Of course, this could possible hurt
performance, due to reading a lot of data, if you consistently only need
data from one file and only occasionally read the other.

If you insist on two for some reason, then I would have the load program
read a record from each file, incrementing a counter, and INSERT the data
from the files into the tables using the counter as the primary key. To
ease things a bit, assuming the 1:1, I would have the PK on the more active
file be a foreign key on the other file, with an ON CASCADE DELETE so that
when you DELETE records from the more active, the corresponding record in
the secondary table will automatically be deleted. You might even want some
sort of CONTRAINT trigger to do a DELETE on the primary table if a record
is DELETEd from the secondary.

Also, if you really want to separate the data into more than one table,
then why only two? In this case, which I admit that I personally _dislike_,
you might want to do an analysis of what columns you access together
frequently and put those columns into separate tables. Again, the main plus
of this would be memory usage and I/O on the data base server side. May Joe
Celko forgive me for introducing physical DB configuration into a data
structure discussion.

<snip>

Thanks in advance!

kj

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
If you sent twitter messages while exploring, are you on a textpedition?

He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

Paul Jungwirth

pj@illuminatedcomputing.com

almost 11 years ago

In reply to: Kynn Jones (#1)

Re: On using doubles as primary keys

On Apr 17, 2015 8:35 AM, "Kynn Jones" <kynnjo@gmail.com> wrote:

(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL

800 million 8-byte numbers doesn't seem totally unreasonable for
python/R/Matlab, if you have a lot of memory. Are you sure you want it in
Postgres? Load the file once then filter it as you like. If you don't have
the memory I can see how using Postgres to get fewer rows at a time might
help. Fewer columns at a time would help even more if that's possible.

In its simplest form, this would mean using
doubles as primary keys, but this seems to me a bit weird.

I'd avoid that and just include an integer PK with your data. Datagrams in
the languages above support that, or just slice off the PK column before
doing your matrix math.

Also instead of 401 columns per row maybe store all 400 doubles in an array
column? Not sure if that's useful for you but maybe it's worth considering.

Also if you put the metadata in the same table as the doubles, can you
leave off the PKs altogether? Why join if you don't have to? It sounds like
the tables are 1-to-1? Even if some metadata is not, maybe you can finesse
it with hstore/arrays.

Good luck!

Paul