Managing two sets of data in one database

Started by Jonathan Bartlettabout 14 years ago8 messagesgeneral
Jump to latest
#1Jonathan Bartlett
jonathan.l.bartlett@gmail.com

I have a database which contains two primary sets of data:

1) A large (~150GB) dataset. This data set is mainly static. It is
updated, but not by the users (it is updated by our company, which provides
the data to users). There are some deletions, but it is safe to consider
this an "add-only" database, where only new records are created.
2) A small (~10MB but growing) dataset. This is the user's data. It
includes many bookmarks (i.e. foreign keys) into data set #1. However, I
am not explicitly using any referential integrity system.

Also, many queries cross the datasets together.

Now, my issue is that right now when we do updates to the dataset, we have
to make them to the live database. I would prefer to manage data releases
the way we manage software releases - have a staging area, test the data,
and then deploy it to the users. However, I am not sure the best approach
for this. If there weren't lots of crossover queries, I could just shove
them in separate databases, and then swap out dataset #1 when we have a new
release.

Does anyone have any ideas?

Thanks,

Jon

#2John R Pierce
pierce@hogranch.com
In reply to: Jonathan Bartlett (#1)
Re: Managing two sets of data in one database

On 03/29/12 9:43 AM, Jonathan Bartlett wrote:

1) A large (~150GB) dataset. This data set is mainly static. It is
updated, but not by the users (it is updated by our company, which
provides the data to users). There are some deletions, but it is safe
to consider this an "add-only" database, where only new records are
created.
2) A small (~10MB but growing) dataset. This is the user's data. It
includes many bookmarks (i.e. foreign keys) into data set #1.
However, I am not explicitly using any referential integrity system.

by 'dataset' do you mean table, aka relation ?

by 'not using any referential integrity', do you mean, you're NOT using
foreign keys ('REFERENCES table(field)' in your table declaration ?

Also, many queries cross the datasets together.

by 'cross', do you mean JOIN ?

Now, my issue is that right now when we do updates to the dataset, we
have to make them to the live database. I would prefer to manage data
releases the way we manage software releases - have a staging area,
test the data, and then deploy it to the users. However, I am not
sure the best approach for this. If there weren't lots of crossover
queries, I could just shove them in separate databases, and then swap
out dataset #1 when we have a new release.

you can't JOIN data across relations(tables) in different databases.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

#3Jonathan Bartlett
jonathan@newmedio.com
In reply to: John R Pierce (#2)
Re: Managing two sets of data in one database

by 'dataset' do you mean table, aka relation ?'

It's a group of tables.

by 'not using any referential integrity', do you mean, you're NOT using
foreign keys ('REFERENCES table(field)' in your table declaration ?

Correct.

Also, many queries cross the datasets together.

by 'cross', do you mean JOIN ?

There are joins, subselects, and similar queries which cross the datasets.

Now, my issue is that right now when we do updates to the dataset, we

have to make them to the live database. I would prefer to manage data
releases the way we manage software releases - have a staging area, test
the data, and then deploy it to the users. However, I am not sure the best
approach for this. If there weren't lots of crossover queries, I could
just shove them in separate databases, and then swap out dataset #1 when we
have a new release.

you can't JOIN data across relations(tables) in different databases.

Right. That's the reason I asked on the list. I didn't know if there is a
good way of managing this sort of data. If I could just have two different
databases, I would have done that a while ago. I didn't know if someone
had a similar situation and what kind of solution they used for it. Right
now, both datasets are in the same database. But that means I can't do
releases of the static dataset, and instead, when the company updates the
database, we have to make the updates directly on the live database. I'm
trying to avoid that and do releases, and I am seeing if anyone knows of a
good approach given the constraints.

Jon

#4Tom Molesworth
tom@audioboundary.com
In reply to: Jonathan Bartlett (#3)
Re: Managing two sets of data in one database

Hi Jonathan,

On 29/03/12 19:01, Jonathan Bartlett wrote:

Now, my issue is that right now when we do updates to the
dataset, we have to make them to the live database. I would
prefer to manage data releases the way we manage software
releases - have a staging area, test the data, and then deploy
it to the users. However, I am not sure the best approach for
this. If there weren't lots of crossover queries, I could
just shove them in separate databases, and then swap out
dataset #1 when we have a new release.

you can't JOIN data across relations(tables) in different databases.

Right. That's the reason I asked on the list. I didn't know if there
is a good way of managing this sort of data. If I could just have two
different databases, I would have done that a while ago. I didn't
know if someone had a similar situation and what kind of solution they
used for it. Right now, both datasets are in the same database. But
that means I can't do releases of the static dataset, and instead,
when the company updates the database, we have to make the updates
directly on the live database. I'm trying to avoid that and do
releases, and I am seeing if anyone knows of a good approach given the
constraints.

Have you considered using views in the queries instead of hitting the
base tables directly? You could then load the releases into a different
schema (so instead of select * from mytable, you have a view which does
select * from release_20110329.mytable, for example) or use different
table names for each release (live_*, test_*, beta_* maybe). Switching
between releases should be fast (and atomic), but everything would still
be within the same database so you'd be able to get to all the data you
need.

cheers,

Tom

#5Jonathan Bartlett
jonathan@newmedio.com
In reply to: Tom Molesworth (#4)
Re: Managing two sets of data in one database

Tom -

Your suggestion gives me an idea, and I'd like your opinion since I haven't
done much with schemas.

(1) Separate the datasets into different schemas
(2) Use different schema names for different static data releases
(3) For the *company*, we can use a schema search path that includes the
next schema
(4) For the *users*, they can use a schema search path that includes the
released schema

Then, I wouldn't have to modify any code, except to set the schema search
path based on who was connecting.

Does this sound reasonable?

Jon

On Thu, Mar 29, 2012 at 1:26 PM, Tom Molesworth <tom@audioboundary.com>wrote:

Show quoted text

Hi Jonathan,

On 29/03/12 19:01, Jonathan Bartlett wrote:

Now, my issue is that right now when we do updates to the dataset, we

have to make them to the live database. I would prefer to manage data
releases the way we manage software releases - have a staging area, test
the data, and then deploy it to the users. However, I am not sure the best
approach for this. If there weren't lots of crossover queries, I could
just shove them in separate databases, and then swap out dataset #1 when we
have a new release.

you can't JOIN data across relations(tables) in different databases.

Right. That's the reason I asked on the list. I didn't know if there
is a good way of managing this sort of data. If I could just have two
different databases, I would have done that a while ago. I didn't know if
someone had a similar situation and what kind of solution they used for it.
Right now, both datasets are in the same database. But that means I can't
do releases of the static dataset, and instead, when the company updates
the database, we have to make the updates directly on the live database.
I'm trying to avoid that and do releases, and I am seeing if anyone knows
of a good approach given the constraints.

Have you considered using views in the queries instead of hitting the base
tables directly? You could then load the releases into a different schema
(so instead of select * from mytable, you have a view which does select *
from release_20110329.mytable, for example) or use different table names
for each release (live_*, test_*, beta_* maybe). Switching between releases
should be fast (and atomic), but everything would still be within the same
database so you'd be able to get to all the data you need.

cheers,

Tom

#6Scott Marlowe
scott.marlowe@gmail.com
In reply to: Jonathan Bartlett (#5)
Re: Managing two sets of data in one database

On Thu, Mar 29, 2012 at 2:39 PM, Jonathan Bartlett
<jonathan@newmedio.com> wrote:

Tom -

Your suggestion gives me an idea, and I'd like your opinion since I haven't
done much with schemas.

(1) Separate the datasets into different schemas
(2) Use different schema names for different static data releases
(3) For the *company*, we can use a schema search path that includes the
next schema
(4) For the *users*, they can use a schema search path that includes the
released schema

Then, I wouldn't have to modify any code, except to set the schema search
path based on who was connecting.

Does this sound reasonable?

It's pretty much what I was going to suggest.

#7Thomas Kellerer
spam_eater@gmx.net
In reply to: Jonathan Bartlett (#5)
Re: Managing two sets of data in one database

Jonathan Bartlett wrote on 29.03.2012 22:39:

Your suggestion gives me an idea, and I'd like your opinion since I haven't done much with schemas.

(1) Separate the datasets into different schemas
(2) Use different schema names for different static data releases
(3) For the *company*, we can use a schema search path that includes the next schema
(4) For the *users*, they can use a schema search path that includes the released schema

Then, I wouldn't have to modify any code, except to set the schema search path based on who was connecting.

If you create different Postgres users, you can set the search_path per user, so you don't have to remember doing that while connecting.

#8John R Pierce
pierce@hogranch.com
In reply to: Thomas Kellerer (#7)
Re: Managing two sets of data in one database

On 03/29/12 2:16 PM, Thomas Kellerer wrote:

Jonathan Bartlett wrote on 29.03.2012 22:39:

Your suggestion gives me an idea, and I'd like your opinion since I
haven't done much with schemas.

(1) Separate the datasets into different schemas
(2) Use different schema names for different static data releases
(3) For the *company*, we can use a schema search path that includes
the next schema
(4) For the *users*, they can use a schema search path that includes
the released schema

Then, I wouldn't have to modify any code, except to set the schema
search path based on who was connecting.

If you create different Postgres users, you can set the search_path
per user, so you don't have to remember doing that while connecting.

heck, the default search_path is $USER,"public"

--
john r pierce N 37, W 122
santa cruz ca mid-left coast