Schema boggle...
Hi All,
I work with Marc Fournier, for those who don't know, and have been
working extensively with the schemas feature for Hub.Org's new Account
Management package. Each client's data is stored in a seperate schema
set asside just for them (though they will never have direct access to
it, it helps us "balance" out the data. Each schema has the same set of
tables, with which different data is stored, depending on the client.
I've run into two issues and I've been reading over the docs in
relation to schemas in the hopes that I could find a solution to my
problems. Here are the details of the issues:
1. The first issue I'm running with, is that if I have several schemas
with the following name:
public
test_001
test_002
test_003
the test_* schemas have the same tables listed in the their name
spaces, such that:
Schema test_001 contains:
test_contact;
test_domains;
test_accounts;
...
Schema test_002 contains:
test_contact;
test_domains;
test_accounts;
...
etc.
As you can see this is a nice, clean way to break down some datasets.
But, if I do:
set search_path to public, test_001, test_002;
I only get access to the tables in test_001 and public, the tables in
test_002 are not listed, and thus I do not see them on the screen while
doing a "\d".
2. Issue 2, is more a SQL issue then anything, but I'd like to be able
to do something to the affect of:
SELECT * FROM test_*.test_domains ORDER BY domain_name;
or better yet:
SELECT * from public.domain_summary ds LEFT JOIN test_*.test_domains td
ON (td.domain_id = ds.id);
This would mean I can run one query to get all of the information I
need, rather then having to run a query for each client that I have to
list domains for. If I've got 10,000 clients, that's going to take a ton
of time to get the results I'm looking for...
Does anyone know if any of these issues have been addressed in 7.4? I
see some references to schema based changes, but no details on what
those changes were. If not, are there any plans to do something like
that?
--
Chris Bowlby <excalibur@hub.org>
Hub.Org Networking Services
Chris Bowlby wrote:
Hi All,
I work with Marc Fournier, for those who don't know, and have been
working extensively with the schemas feature for Hub.Org's new Account
Management package. Each client's data is stored in a seperate schema
set asside just for them (though they will never have direct access to
it, it helps us "balance" out the data. Each schema has the same set of
tables, with which different data is stored, depending on the client.I've run into two issues and I've been reading over the docs in
relation to schemas in the hopes that I could find a solution to my
problems. Here are the details of the issues:1. The first issue I'm running with, is that if I have several schemas
with the following name:public
test_001
test_002
test_003the test_* schemas have the same tables listed in the their name
spaces, such that:Schema test_001 contains:
test_contact;
test_domains;
test_accounts;
...Schema test_002 contains:
test_contact;
test_domains;
test_accounts;
...etc.
As you can see this is a nice, clean way to break down some datasets.
But, if I do:set search_path to public, test_001, test_002;
I only get access to the tables in test_001 and public, the tables in
test_002 are not listed, and thus I do not see them on the screen while
doing a "\d".2. Issue 2, is more a SQL issue then anything, but I'd like to be able
to do something to the affect of:SELECT * FROM test_*.test_domains ORDER BY domain_name;
or better yet:
SELECT * from public.domain_summary ds LEFT JOIN test_*.test_domains td
ON (td.domain_id = ds.id);This would mean I can run one query to get all of the information I
need, rather then having to run a query for each client that I have to
list domains for. If I've got 10,000 clients, that's going to take a ton
of time to get the results I'm looking for...Does anyone know if any of these issues have been addressed in 7.4? I
see some references to schema based changes, but no details on what
those changes were. If not, are there any plans to do something like
that?
You can't use a wildcard like that, but you can create a view like
CREATE VIEW public.all_test_domains AS
SELECT * FROM test_001.test_domains
UNION ALL
SELECT * FROM test_002.test_domains;
You have to change that view as customers (resp. schemas) come and go.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
On Wed, 5 Nov 2003, Jan Wieck wrote:
Chris Bowlby wrote:
Hi All,
I work with Marc Fournier, for those who don't know, and have been
working extensively with the schemas feature for Hub.Org's new Account
Management package. Each client's data is stored in a seperate schema
set asside just for them (though they will never have direct access to
it, it helps us "balance" out the data. Each schema has the same set of
tables, with which different data is stored, depending on the client.I've run into two issues and I've been reading over the docs in
relation to schemas in the hopes that I could find a solution to my
problems. Here are the details of the issues:1. The first issue I'm running with, is that if I have several schemas
with the following name:public
test_001
test_002
test_003the test_* schemas have the same tables listed in the their name
spaces, such that:Schema test_001 contains:
test_contact;
test_domains;
test_accounts;
...Schema test_002 contains:
test_contact;
test_domains;
test_accounts;
...etc.
As you can see this is a nice, clean way to break down some datasets.
But, if I do:set search_path to public, test_001, test_002;
I only get access to the tables in test_001 and public, the tables in
test_002 are not listed, and thus I do not see them on the screen while
doing a "\d".2. Issue 2, is more a SQL issue then anything, but I'd like to be able
to do something to the affect of:SELECT * FROM test_*.test_domains ORDER BY domain_name;
or better yet:
SELECT * from public.domain_summary ds LEFT JOIN test_*.test_domains td
ON (td.domain_id = ds.id);This would mean I can run one query to get all of the information I
need, rather then having to run a query for each client that I have to
list domains for. If I've got 10,000 clients, that's going to take a ton
of time to get the results I'm looking for...Does anyone know if any of these issues have been addressed in 7.4? I
see some references to schema based changes, but no details on what
those changes were. If not, are there any plans to do something like
that?You can't use a wildcard like that, but you can create a view like
CREATE VIEW public.all_test_domains AS
SELECT * FROM test_001.test_domains
UNION ALL
SELECT * FROM test_002.test_domains;You have to change that view as customers (resp. schemas) come and go.
Oh, just thought of it ... I know there used to be limitation on this, but
I swore that they were removed ... how about a function? could you do
something like:
for i in `SELECT nspname FROM nspname WHERE nspname LIKE 'test_*'`
... I know functins returning multi-rows was a limitation in the past, but
didn't that *just* get corrected, or am I mis-remembering?
Chris Bowlby <excalibur@hub.org> writes:
As you can see this is a nice, clean way to break down some datasets.
But, if I do:
set search_path to public, test_001, test_002;
I only get access to the tables in test_001 and public, the tables in
test_002 are not listed, and thus I do not see them on the screen while
doing a "\d".
Well, sure. They are masked by the identically named tables in
test_001. How else would you expect it to work?
2. Issue 2, is more a SQL issue then anything, but I'd like to be able
to do something to the affect of:
SELECT * FROM test_*.test_domains ORDER BY domain_name;
I can't see trying to support this at the SQL level. You might consider
making a plpgsql function that constructs a UNION ALL query listing all
the proper schemas and then EXECUTEs it on-the-fly.
Another possible solution is to make all the individual tables be
inheritance children of a master table. Then selecting from the master
would work (and would be equivalent to the UNION performance-wise, I
think).
regards, tom lane
Chris,
I work with Marc Fournier, for those who don't know, and have been
working extensively with the schemas feature for Hub.Org's new Account
Management package. Each client's data is stored in a seperate schema
set asside just for them (though they will never have direct access to
it, it helps us "balance" out the data. Each schema has the same set of
tables, with which different data is stored, depending on the client.
While this is an interesting use of Schema, it is not, IMHO, a good way to
partition client data. Far better to have monolithic tables with a
"client_id" column and then construct *views* which are available in each
schema only to that client. Otherwise, as you've discovered, top-level
management becomes a royal pain.
I'd be happy to discuss this further on PGSQL-SQL, which is really the
appropriate mailing list.
--
-Josh Berkus
Aglio Database Solutions
San Francisco
Import Notes
Reply to msg id not found: E1AHVFF-0002Up-75@noon.pghoster.comReference msg id not found: E1AHVFF-0002Up-75@noon.pghoster.com | Resolved by subject fallback
On Wed, 5 Nov 2003, Tom Lane wrote:
Chris Bowlby <excalibur@hub.org> writes:
As you can see this is a nice, clean way to break down some datasets.
But, if I do:set search_path to public, test_001, test_002;
I only get access to the tables in test_001 and public, the tables in
test_002 are not listed, and thus I do not see them on the screen while
doing a "\d".Well, sure. They are masked by the identically named tables in
test_001. How else would you expect it to work?
List of relations
Schema | Name | Type | Owner
----------+-----------------------+----------+-----------
public | categories | table | 186_pgsql
public | categories_rec_id_seq | sequence | 186_pgsql
test_001 | table1 | table | 186_pgsql
test_002 | table1 | table | 186_pgsql
the uniqueness, I would have thought, woudl have been schema.name, not
just name ...
"Marc G. Fournier" <scrappy@postgresql.org> writes:
How else would you expect it to work?
List of relations
Schema | Name | Type | Owner
----------+-----------------------+----------+-----------
public | categories | table | 186_pgsql
public | categories_rec_id_seq | sequence | 186_pgsql
test_001 | table1 | table | 186_pgsql
test_002 | table1 | table | 186_pgsql
You can get something like that if you say "\d *.*". If you say "\d *"
or equivalently just "\d", then what you see is only the tables that you
could reference with unqualified names. Which does not include the
tables in test_002, because they're hidden by the ones in test_001.
regards, tom lane
Actually, the use of schema's was my idea, to speed up some dreadfully
slow queries dealing with traffic stats from a table that was growing
painfully monolithic ... the idea is/was that it would be easier to
backup/remove all data pertaining to a specific client if they decided to
close their account ...
On Wed, 5 Nov 2003, Josh Berkus wrote:
Show quoted text
Chris,
I work with Marc Fournier, for those who don't know, and have been
working extensively with the schemas feature for Hub.Org's new Account
Management package. Each client's data is stored in a seperate schema
set asside just for them (though they will never have direct access to
it, it helps us "balance" out the data. Each schema has the same set of
tables, with which different data is stored, depending on the client.While this is an interesting use of Schema, it is not, IMHO, a good way to
partition client data. Far better to have monolithic tables with a
"client_id" column and then construct *views* which are available in each
schema only to that client. Otherwise, as you've discovered, top-level
management becomes a royal pain.I'd be happy to discuss this further on PGSQL-SQL, which is really the
appropriate mailing list.--
-Josh Berkus
Aglio Database Solutions
San Francisco---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Hi Marc,
I was actually leaning towards schema's as a method to partition the data
out when I was beginning to plan AMS 2, your suggestions of schema's
confirmed this as a good approach for what we were looking for so I started
to code the initial layout for the Schema in April/May, but at the time I
did not have the full picture in my head on dealing with some sections of
the database....
To address Josh's comments though.
Monolithic tables have their uses, I don't and wont contest that, however
we've been using a monolithic structure in our current AMS distribution and
in a lot of areas it has been quite a pain to try and bring some uniformity
into the structure. First off, I have an ever changing platform and one of
the issues I had was that development under the monolithic structure
because problematic and very difficult to merge new elements in, granted
the SQL schema was not all that great and a number of areas could have used
improvement, but over all it was more trouble then it was worth.
By breaking out our client content into a schema structure, I've
simplified the management of my data immensely, I can easily delete
old/unused data with out having to worry about updating scripts that have
been written to clean out structures. I can give clients greater control
over their own data with out worrying about some "security" issue that I
might have missed in my programming, so there's and added level of safety
in an already tight structure. I've "objectified" out the content into an
easily changeable/"update friendly" system of tables, views and functions.
I've reduced a fair amount of overhead by limiting the order of tables, For
instance, with our original monolithic dataset, we have approx 90+ tables
handling all of the content that we use on a regular basis. With this new
structure I've reduced that down to a total of 30 tables, 8 of which are
repeated across the schemas. The method we are using the schemas in also
allows me to work in some assumptions that were tricky to code under the
monolithic structure, with the schema system, simplicity of the table set
allowed allowed me to simulate those assumptions with out having to
re-write code each time I have an update, thus giving me a highly dynamic
dataset. That and I'm more confident working on a 3D level, versus 2D, the
schema levels introduce the third dimension into my structure and simplify
visualizing how I want things to work.
Within that third dimensional structure, it made sense to be able to code
out an SQL statement that would have searched the schema set for matching
patterns to that given in the SQL query, similar to the way one can search
for a pattern on data in a column. But Tom has given me an idea that will
allow me to work out how to search multiple schemas within a dynamic
plpgsql function that figures out all my patterned schemas and executes the
resulting query as he suggested, I just need to learn plpgsql programming
so that I can work out all of the details (Tom, I might bug you about that
once in a while).
At 09:45 PM 11/5/03 -0400, Marc G. Fournier wrote:
Show quoted text
Actually, the use of schema's was my idea, to speed up some dreadfully
slow queries dealing with traffic stats from a table that was growing
painfully monolithic ... the idea is/was that it would be easier to
backup/remove all data pertaining to a specific client if they decided to
close their account ...On Wed, 5 Nov 2003, Josh Berkus wrote:
Chris,
I work with Marc Fournier, for those who don't know, and have been
working extensively with the schemas feature for Hub.Org's new Account
Management package. Each client's data is stored in a seperate schema
set asside just for them (though they will never have direct access to
it, it helps us "balance" out the data. Each schema has the same set of
tables, with which different data is stored, depending on the client.While this is an interesting use of Schema, it is not, IMHO, a good way to
partition client data. Far better to have monolithic tables with a
"client_id" column and then construct *views* which are available in each
schema only to that client. Otherwise, as you've discovered, top-level
management becomes a royal pain.I'd be happy to discuss this further on PGSQL-SQL, which is really the
appropriate mailing list.--
-Josh Berkus
Aglio Database Solutions
San Francisco---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Chris,
Monolithic tables have their uses, I don't and wont contest that, however
we've been using a monolithic structure in our current AMS distribution and
<snip>
dataset. That and I'm more confident working on a 3D level, versus 2D, the
schema levels introduce the third dimension into my structure and simplify
visualizing how I want things to work.
TMI! TMI! Ok, I believe that it works for you. You've made a tradeoff,
though, and some operations that were simple in the old monolithic structure
will now be more difficult -- however worthwhile the overall tradeoff may be.
resulting query as he suggested, I just need to learn plpgsql programming
so that I can work out all of the details (Tom, I might bug you about that
once in a while).
Just go ahead and post your PL/pgSQL questions to the list. There are people
on this list who've written thousands of lines of PL/pgSQL ...
--
Josh Berkus
Aglio Database Solutions
San Francisco
"Marc G. Fournier" <scrappy@postgresql.org> writes:
Actually, the use of schema's was my idea, to speed up some dreadfully
slow queries dealing with traffic stats from a table that was growing
painfully monolithic ... the idea is/was that it would be easier to
backup/remove all data pertaining to a specific client if they decided to
close their account ...
I would add an <aol> me too </aol> to the comment that this is a horrible idea
and will be an enormous pain. You're denormalizing your data in a way that
will cause you great pain.
What you're trying to implement is called "partitioned tables". And you're
right, bulk loading and deleting is one of the big advantages of partitioned
tables. But Postgres doesn't support partitioned tables, so trying to roll
your own is hard.
Using schemas seems like a bad impedance match here too, as you've found out
with the search_path. Inherited tables is a closer match, still a lot of
things won't be automatic, and primary keys may be a problem.
But what most people use as the closest approximation to partitioned tables in
Postgres is partial indexes. It lets you keep your indexes to a reasonable
size but still accelerate the poor-selectivity client_id column in parallel.
But you still would be storing all the records in a single table and would
have to do some big vacuuming whenever you delete a whole client. vacuum full
may be your friend.
I don't understand why the queries should be any slower dealing with the
normalized data in a single table versus the dernormalized tables in multiple
tables. The one big exception is any query doing "where client_id = ?" where
the latter allows the use of a sequential scan instead of an index scan.
--
greg