Advice for using integer arrays?

Started by Michael Heaneyover 11 years ago9 messagesgeneral
Jump to latest
#1Michael Heaney
mheaney@jcvi.org

I'm fairly new to Postgres, and have a design issue for which an array
of integers might be a good solution. But I'd like to hear from the
experts before proceeding down this path.

Essentially, I'm trying to model the relationship between a group of
biological samples and their genes. Each sample (there are ~10K of them
now, with more coming) will have about 30,000 genes. Conversely, a
particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many relationship:

create table sample_gene (id serial, sample_id int, gene_id int);

which looks like this when populated:

sample_id | gene_id
---------------------------------------
1 | 1
1 | 2
...
1 | 30475
2 | 1
2 | 2
...
2 | 29973
3 | 1
etc.

The table now contains hundreds of millions of rows (with many, many
more to come). Join performance between samples and genes is quite
slow, even with indexes on sample_id and gene_id.

So it occurred to me: why not eliminate all the duplicate sample_id
values by storing the gene_id's in an array, like so:

create table sample_gene_array (id serial, sample_id int, gene_id int
[] );

So now the table data looks like this:

sample_id | gene_id []
---------------------------------------
1 | [1:30475]
2 | [1:29973]
etc.

The new table is significantly smaller, and performance (using ANY[] )
is quite good. Nevertheless, I'm uneasy. I come from a Sybase ASE
background, and so have no experience with arrays as datatypes. Is it
okay to store 30K+ gene values in an array in the linking table (or
maybe even in the sample table itself, thus eliminating the linking
table)? Should I unnest the gene_id's first, before using them to join
to the gene table?

TIA for any guidance you can provide. Again, I'm a Postgres neophyte -
but I'm in awe of the power and flexibility of this database, and wish
that I'd started using it sooner.

------
Michael Heaney
JCVI

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Arthur Silva
arthurprs@gmail.com
In reply to: Michael Heaney (#1)
Re: Advice for using integer arrays?

On Jan 6, 2015 3:12 PM, "Michael Heaney" <mheaney@jcvi.org> wrote:

I'm fairly new to Postgres, and have a design issue for which an array of

integers might be a good solution. But I'd like to hear from the experts
before proceeding down this path.

Essentially, I'm trying to model the relationship between a group of

biological samples and their genes. Each sample (there are ~10K of them
now, with more coming) will have about 30,000 genes. Conversely, a
particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many

relationship:

create table sample_gene (id serial, sample_id int, gene_id int);

which looks like this when populated:

sample_id | gene_id
---------------------------------------
1 | 1
1 | 2
...
1 | 30475
2 | 1
2 | 2
...
2 | 29973
3 | 1
etc.

The table now contains hundreds of millions of rows (with many, many more

to come). Join performance between samples and genes is quite slow, even
with indexes on sample_id and gene_id.

So it occurred to me: why not eliminate all the duplicate sample_id

values by storing the gene_id's in an array, like so:

create table sample_gene_array (id serial, sample_id int, gene_id int []

);

So now the table data looks like this:

sample_id | gene_id []
---------------------------------------
1 | [1:30475]
2 | [1:29973]
etc.

The new table is significantly smaller, and performance (using ANY[] ) is

quite good. Nevertheless, I'm uneasy. I come from a Sybase ASE
background, and so have no experience with arrays as datatypes. Is it okay
to store 30K+ gene values in an array in the linking table (or maybe even
in the sample table itself, thus eliminating the linking table)? Should I
unnest the gene_id's first, before using them to join to the gene table?

TIA for any guidance you can provide. Again, I'm a Postgres neophyte -

but I'm in awe of the power and flexibility of this database, and wish that
I'd started using it sooner.

------
Michael Heaney
JCVI

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Please provide sample queries so we can understand how you query the data.

#3Rob Sargent
robjsargent@gmail.com
In reply to: Michael Heaney (#1)
Re: Advice for using integer arrays?

On 01/06/2015 10:09 AM, Michael Heaney wrote:

I'm fairly new to Postgres, and have a design issue for which an array
of integers might be a good solution. But I'd like to hear from the
experts before proceeding down this path.

Essentially, I'm trying to model the relationship between a group of
biological samples and their genes. Each sample (there are ~10K of
them now, with more coming) will have about 30,000 genes. Conversely,
a particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many
relationship:

create table sample_gene (id serial, sample_id int, gene_id int);

which looks like this when populated:

sample_id | gene_id
---------------------------------------
1 | 1
1 | 2
...
1 | 30475
2 | 1
2 | 2
...
2 | 29973
3 | 1
etc.

The table now contains hundreds of millions of rows (with many, many
more to come). Join performance between samples and genes is quite
slow, even with indexes on sample_id and gene_id.

So it occurred to me: why not eliminate all the duplicate sample_id
values by storing the gene_id's in an array, like so:

create table sample_gene_array (id serial, sample_id int, gene_id int
[] );

So now the table data looks like this:

sample_id | gene_id []
---------------------------------------
1 | [1:30475]
2 | [1:29973]
etc.

The new table is significantly smaller, and performance (using ANY[] )
is quite good. Nevertheless, I'm uneasy. I come from a Sybase ASE
background, and so have no experience with arrays as datatypes. Is it
okay to store 30K+ gene values in an array in the linking table (or
maybe even in the sample table itself, thus eliminating the linking
table)? Should I unnest the gene_id's first, before using them to
join to the gene table?

TIA for any guidance you can provide. Again, I'm a Postgres neophyte
- but I'm in awe of the power and flexibility of this database, and
wish that I'd started using it sooner.

------
Michael Heaney
JCVI

Think genotype calls (sample, marker, call) and this just explodes in
terms of number of rows but I'm trying to not let that bother me.
Partion by chromosome, or chromosome arm (even if partition is just
separate tables) and the magnitude becomes manageable. At least as
manageable as multi-gigabyte gvcf files per sample from GATK.

But I'm a little confused: all samples for a given species will have the
same genes (roughly). Are you storing gene variants (sequence or
otherwise) per sample?

#4Jeff Janes
jeff.janes@gmail.com
In reply to: Michael Heaney (#1)
Re: Advice for using integer arrays?

On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney <mheaney@jcvi.org> wrote:

I'm fairly new to Postgres, and have a design issue for which an array of
integers might be a good solution. But I'd like to hear from the experts
before proceeding down this path.

Essentially, I'm trying to model the relationship between a group of
biological samples and their genes. Each sample (there are ~10K of them
now, with more coming) will have about 30,000 genes. Conversely, a
particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many
relationship:

create table sample_gene (id serial, sample_id int, gene_id int);

What is the value of having this table at all? It doesn't seem to contain
anything informative, like an allele identifier, a resequence, or a copy
number variation. If you are just trying to record the fact that a gene
was present in that sample, perhaps it would be better to instead record
the genes have been deleted, rather than the ones that have not been
deleted? That would probably be a much smaller list.

create table sample_gene_array (id serial, sample_id int, gene_id int []
);

So now the table data looks like this:

sample_id | gene_id []
---------------------------------------
1 | [1:30475]
2 | [1:29973]
etc.

I'm not familiar with the square bracket and colon as a syntax for
expressing int arrays. Are you taking liberties with the psql output, or
using a different client program? Does that represent the range from 1 to
30475, or the two values 1 and 30475?

Cheers,

Jeff

#5Michael Heaney
mheaney@jcvi.org
In reply to: Jeff Janes (#4)
Re: Advice for using integer arrays?

On 1/6/2015 2:19 PM, Jeff Janes wrote:

On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney <mheaney@jcvi.org
<mailto:mheaney@jcvi.org>> wrote:

I'm fairly new to Postgres, and have a design issue for which an
array of integers might be a good solution. But I'd like to hear
from the experts before proceeding down this path.

Essentially, I'm trying to model the relationship between a group
of biological samples and their genes. Each sample (there are ~10K
of them now, with more coming) will have about 30,000 genes.
Conversely, a particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many
relationship:

create table sample_gene (id serial, sample_id int, gene_id int);

What is the value of having this table at all? It doesn't seem to
contain anything informative, like an allele identifier, a resequence,
or a copy number variation. If you are just trying to record the fact
that a gene was present in that sample, perhaps it would be better to
instead record the genes have been deleted, rather than the ones that
have not been deleted? That would probably be a much smaller list.

I suppose there could be a gene table which would contain data about
each gene_id.
But I'm an IT guy, not a biologist, and my sample_gene table doesn't
actually
exist. I'm more concerned with how to deal with many-to-many relationships
when each parent could have tens of thousands of children. Collapsing all
the children into an array for each parent looked intriguing - but maybe
it's
not a good idea. I just don't know, so I thought I'd ask you guys.

create table sample_gene_array (id serial, sample_id int, gene_id
int [] );

So now the table data looks like this:

sample_id | gene_id []
---------------------------------------
1 | [1:30475]
2 | [1:29973]
etc.

I'm not familiar with the square bracket and colon as a syntax for
expressing int arrays. Are you taking liberties with the psql output,
or using a different client program? Does that represent the range
from 1 to 30475, or the two values 1 and 30475?

Yes, it's shorthand for the full range of values from 1 to 30475. Wasn't
sure how
to represent it, and almost went with (1,2,3...30475). Apologies...

Michael Heaney
JCVI

#6Paul Jungwirth
pj@illuminatedcomputing.com
In reply to: Michael Heaney (#5)
Re: Advice for using integer arrays?

Hi Michael,

I can't comment on the domain-specific stuff, but I recently used numeric
arrays for a project and it worked well. In my case we had one million
simulation results (floats) per scenario, so rather than reading one
million separate rows to compute a histogram, we stored everything in one
row per scenario. Ideally one million floats is 8 megabytes, which is big
but still shouldn't require more than 100ms to read from disk and feed into
a simple computation. Here are some functions I wrote to make it easier &
faster to compute stats from numeric arrays:

https://github.com/pjungwir/aggs_for_arrays/

One caveat is that you will lose a lot of benefits by leaving the
relational model, e.g. being able to filter your integers by other
criteria. In our case all we had were floats, but imagine if you had
metadata attached to each one like the time the simulation was run, who ran
it, etc. Then you'd want to stick with something richer than just an array
of numbers.

But in that case maybe parallel arrays is acceptable. It would be a bit
like a column-store inside of Postgres. :-) I've been meaning to add a
function to that Github repo to filter an array given a same-size array of
booleans, so you can do filtering like in R or Pandas, but I haven't found
a nice way in Postgres to express e.g. `filter_array(simulation_results,
simulation_run_times > now() - interval '1 day')`.

Good luck!
Paul

On Tue, Jan 6, 2015 at 1:18 PM, Michael Heaney <mheaney@jcvi.org> wrote:

On 1/6/2015 2:19 PM, Jeff Janes wrote:

On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney <mheaney@jcvi.org> wrote:

I'm fairly new to Postgres, and have a design issue for which an array of
integers might be a good solution. But I'd like to hear from the experts
before proceeding down this path.

Essentially, I'm trying to model the relationship between a group of
biological samples and their genes. Each sample (there are ~10K of them
now, with more coming) will have about 30,000 genes. Conversely, a
particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many
relationship:

create table sample_gene (id serial, sample_id int, gene_id int);

What is the value of having this table at all? It doesn't seem to
contain anything informative, like an allele identifier, a resequence, or a
copy number variation. If you are just trying to record the fact that a
gene was present in that sample, perhaps it would be better to instead
record the genes have been deleted, rather than the ones that have not been
deleted? That would probably be a much smaller list.

I suppose there could be a gene table which would contain data about each
gene_id.
But I'm an IT guy, not a biologist, and my sample_gene table doesn't
actually
exist. I'm more concerned with how to deal with many-to-many relationships
when each parent could have tens of thousands of children. Collapsing all
the children into an array for each parent looked intriguing - but maybe
it's
not a good idea. I just don't know, so I thought I'd ask you guys.

create table sample_gene_array (id serial, sample_id int, gene_id int []
);

So now the table data looks like this:

sample_id | gene_id []
---------------------------------------
1 | [1:30475]
2 | [1:29973]
etc.

I'm not familiar with the square bracket and colon as a syntax for
expressing int arrays. Are you taking liberties with the psql output, or
using a different client program? Does that represent the range from 1 to
30475, or the two values 1 and 30475?

Yes, it's shorthand for the full range of values from 1 to 30475. Wasn't
sure how
to represent it, and almost went with (1,2,3...30475). Apologies...

Michael Heaney
JCVI

--
_________________________________
Pulchritudo splendor veritatis.

#7Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Michael Heaney (#5)
Re: Advice for using integer arrays?

On 01/06/2015 01:18 PM, Michael Heaney wrote:

On 1/6/2015 2:19 PM, Jeff Janes wrote:

On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney <mheaney@jcvi.org
<mailto:mheaney@jcvi.org>> wrote:

I'm fairly new to Postgres, and have a design issue for which an
array of integers might be a good solution. But I'd like to hear
from the experts before proceeding down this path.

Essentially, I'm trying to model the relationship between a group
of biological samples and their genes. Each sample (there are ~10K
of them now, with more coming) will have about 30,000 genes.
Conversely, a particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many
relationship:

create table sample_gene (id serial, sample_id int, gene_id int);

What is the value of having this table at all? It doesn't seem to
contain anything informative, like an allele identifier, a resequence,
or a copy number variation. If you are just trying to record the fact
that a gene was present in that sample, perhaps it would be better to
instead record the genes have been deleted, rather than the ones that
have not been deleted? That would probably be a much smaller list.

I suppose there could be a gene table which would contain data about
each gene_id.
But I'm an IT guy, not a biologist, and my sample_gene table doesn't
actually
exist.

Alright, now I am confused. In your original post you say sample_gene
does exist and you joined it against genes(I assume a gene table) which
is when you encountered slow performance. Did I miss something:)?

Per a previous suggestion, it might be good to show the queries you are
using or plan to use. The issue may be in the query not the layout. Also
running the query with EXPLAIN ANALYZE would be helpful:

http://www.postgresql.org/docs/9.3/interactive/sql-explain.html

I'm more concerned with how to deal with many-to-many relationships

when each parent could have tens of thousands of children. Collapsing all
the children into an array for each parent looked intriguing - but maybe
it's
not a good idea. I just don't know, so I thought I'd ask you guys.

create table sample_gene_array (id serial, sample_id int, gene_id
int [] );

So now the table data looks like this:

sample_id | gene_id []
---------------------------------------
1 | [1:30475]
2 | [1:29973]
etc.

I'm not familiar with the square bracket and colon as a syntax for
expressing int arrays. Are you taking liberties with the psql output,
or using a different client program? Does that represent the range
from 1 to 30475, or the two values 1 and 30475?

Yes, it's shorthand for the full range of values from 1 to 30475. Wasn't
sure how
to represent it,

Use Python? Looks like a 1 based slice notation.

and almost went with (1,2,3...30475). Apologies...

Michael Heaney
JCVI

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Martijn van Oosterhout
kleptog@svana.org
In reply to: Michael Heaney (#1)
Re: Advice for using integer arrays?

On Tue, Jan 06, 2015 at 12:09:56PM -0500, Michael Heaney wrote:

I'm fairly new to Postgres, and have a design issue for which an
array of integers might be a good solution. But I'd like to hear
from the experts before proceeding down this path.

The biggest consideration is if you are ever intending to use the
values in a join condition. Arrays a efficient space-wise and you also
have good indexing strategies with GIN indexes. You will need to
reframe your queries in terms of ([x] subset-of field) but that's
relatively straightforward.

What doesn't work or is fiddely:

- foreign keys

- selecting part of the list

- reordering or otherwise manipulating the list.

basically, if conceptually the list is a single object which you're
really only going to want to access as a whole, but still want good
indexing, then arrays are for you.

BTW, looking at your example, you might be more interested in ranges,
see for example:
http://www.postgresql.org/docs/9.2/static/rangetypes.html

Conceptually they are a bit different and there isn't support for
multi-ranges AFAIK but they might be more appropriate.

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#9Thomas Kellerer
spam_eater@gmx.net
In reply to: Martijn van Oosterhout (#8)
Re: Advice for using integer arrays?

BTW, looking at your example, you might be more interested in ranges,
see for example:
http://www.postgresql.org/docs/9.2/static/rangetypes.html

Conceptually they are a bit different and there isn't support for
multi-ranges AFAIK

You could have an array of ranges

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general