Re: Access restriction

Started by Tom Laneover 25 years ago31 messagesgeneral
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Jeff Davis <jdavis@wasabimg.com> writes:

I am trying to set up a database system where each user has his/her own
database without access outside of it of any kind.

I think you want "sameuser" in pg_hba.conf.

regards, tom lane

#2GH
grasshacker@over-yonder.net
In reply to: Tom Lane (#1)

On Wed, Nov 29, 2000 at 12:12:00AM -0500, some SMTP stream spewed forth:

Jeff Davis <jdavis@wasabimg.com> writes:

I am trying to set up a database system where each user has his/her own
database without access outside of it of any kind.

I think you want "sameuser" in pg_hba.conf.

Would you mind elaborating on that a bit for me?
i.e. Where should it go in pg_hba.conf?

I don't think I saw this in the docs...

Thanks.

gh

Show quoted text

regards, tom lane

#3Gordan Bobic
gordan@freeuk.com
In reply to: Tom Lane (#1)
Database cluster?

Hi!

I apologize if this is documented elsewhere, but I have looked and failed
to find it.

Setup:

I have a database that contains around 50 tables, and they range in size
from several hundred megabytes, to several gigabytes in size, and between
several tens of thousands of records, to several tens of millions of
records.

These tables grow with a daily data intake by between 100 and 500 MB per
day.

What I am considering:

I am considering splitting the database into tables residing on separate
machines, and connect them on one master node.

The question I have is:

1) How can I do this using PostgreSQL? Is there a standard way of attaching
tables from external databases which reside on different machines?

2) Does the query optimizer understand this sort of setup and make
appropriate adjustments to it's operation? E.g. select subsets from each
table separately from each remote table, so that selects run on the remote
machine, and then do a multi-table join based on that. This is probably a
rather poor example, as there are much more efficient ways to minimize the
amount of work to be done for this sort of thing, but I am only trying to
illustrate the question.

3) What sort of performance increase can I expect from distributing the
database in this manner? What operations would benefit most?

4) Can I set up a "master node" database which connects all the external
tables, and presents them as if they were local to clients? I would like to
keep using the "cluster" in the same way I use the current system, by
making one machine handle all external requests, and taking care of what
goes where and on which "slave node".

If these questions are answered elsewhere, please point me in the right
direction.

Regards.

Gordan

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: GH (#2)

GH <grasshacker@over-yonder.net> writes:

I think you want "sameuser" in pg_hba.conf.

Would you mind elaborating on that a bit for me?
i.e. Where should it go in pg_hba.conf?
I don't think I saw this in the docs...

It's right in pg_hba.conf:

# Format:
#
# host DBNAME IP_ADDRESS ADDRESS_MASK AUTHTYPE [AUTH_ARGUMENT]
#
# DBNAME is the name of a PostgreSQL database, or "all" to indicate all
# databases, or "sameuser" to restrict a user's access to a database with
# the same name as the user.

One thing this doesn't make too clear is that the "restriction" really
means that this record *matches* if the username and the database name
are the same. Else the postmaster moves on, looking for another
matching record. A possibly useful example:

localhost sameuser ident
localhost all passwd passfile

This would mean that a user would get let into his own database on
the basis of IDENT checking, and would get let into other databases
on the basis of password checking using passwords in $PGDATA/passfile.
Since you'd control the contents of this last file, this would allow
you to grant access to all databases to only selected users.
You want some sort of escape hatch like that, at least for the DBA ---
else he couldn't get into any database but his own, either. Superusers
aren't super as far as the postmaster is concerned... they have to pass
the authentication tests the same as mere mortals.

regards, tom lane

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Gordan Bobic (#3)
Re: Database cluster?

Gordan Bobic writes:

I am considering splitting the database into tables residing on separate
machines, and connect them on one master node.

The question I have is:

1) How can I do this using PostgreSQL?

You can't.

--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/

#6Steve Wolfe
steve@iboats.com
In reply to: Peter Eisentraut (#5)
Re: Database cluster?

I am considering splitting the database into tables residing on separate
machines, and connect them on one master node.

The question I have is:

1) How can I do this using PostgreSQL?

You can't.

I'll jump in with a bit more info. Splitting tables across multiple
machines would do nothing more than make the entire system run at a snail's
pace. Yes, it would slow it down immensely, because you just couldn't move
data between machines quickly enough.

Why? Well, whenever you join two tables that are on different machines,
the tables have to go across whatever sort of connection you have between
the machines. Even if you use gigabit ethernet, you are still running at a
mere fraction of the bandwidth of the computer's internal bus - and at
orders of magnitude greater latency. You'd have lots of CPU's sitting
around, doing absolutely nothing, waiting for data to come across the wire.

There are alternatives, such as IP-over-SCSI. That reduces the latency
of ethernet quite a bit, and gives you much more bandwidth (say, up to 160
megabytes/second). However, that's still a pittance compared to the main
system bus inside your computer.

That's one of the greatest hurdles to distributed computing. That's why
the applications that are best adapted to distributed computing are those
that don't require much data over the wire - which certainly doesn't apply
to databases. : )

steve

#7Gordan Bobic
gordan@freeuk.com
In reply to: Peter Eisentraut (#5)
Re: Database cluster?

I am considering splitting the database into tables residing on

separate

machines, and connect them on one master node.

The question I have is:

1) How can I do this using PostgreSQL?

You can't.

I'll jump in with a bit more info. Splitting tables across multiple
machines would do nothing more than make the entire system run at a

snail's

pace. Yes, it would slow it down immensely, because you just couldn't

move

data between machines quickly enough.

I don't believe that is the case. In my case, queries typically return
comparatively small amounts of data. Around 100 records at most. The amount
of data that needs to be transferred is comparatively small, and even over
10 Mb ethernet, it would take at most about a second to transfer. This is a
much smaller delay than the query time itself, which can take 10 seconds or
more. Remember that I said there are tables with over 30M records? Doing
multi-table joins on things like this takes a long time...

So, splitting the data in such a way that one table is queried, and then
tables joined from it are queried in parallel, would cause a signifficant
speed-up.

For example, say we have tables T1, T2 and T3.

T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least,
probably lots of other fields.

Say I want to do
SELECT *
FROM T1
WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata';

Then F1.3 could be searched for 'somedata'. When the records are found,
this could be cross-matched remotely, in parallel for F1.1=F2.1 and
F1.2=F3.1, on different machines.

This means that depending on the type, configuration and the usage of the
database, a potentially massive improvement in performance could be
achiveved, especially on multi-table joins which span lots of BIG tables.

Somebody mentioned the fact that postgres uses IPC for communicating
between processes. I think there are tools for clustering (I am not sure if
Mosix supports transparently allowing IPC across nodes) which can work
around that.

Why? Well, whenever you join two tables that are on different

machines,

the tables have to go across whatever sort of connection you have between
the machines. Even if you use gigabit ethernet, you are still running at

a

mere fraction of the bandwidth of the computer's internal bus - and at
orders of magnitude greater latency. You'd have lots of CPU's sitting
around, doing absolutely nothing, waiting for data to come across the

wire.

Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it all
depends on what machine you have running this. This would be true in the
case that the datbase server is a nice big Alpha with severl CPUs.

There are alternatives, such as IP-over-SCSI. That reduces the

latency

of ethernet quite a bit, and gives you much more bandwidth (say, up to

160

megabytes/second). However, that's still a pittance compared to the main
system bus inside your computer.

But SCSI is still 160MB burst (not sustained, unless you're using very
expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater
latency.

That's one of the greatest hurdles to distributed computing. That's

why

the applications that are best adapted to distributed computing are those
that don't require much data over the wire - which certainly doesn't

apply

to databases. : )

I think it depends whether the amount of data is the problem, or fitting it
together.

Somebody please explain to me further why I am wrong in all this?

Regards.

Gordan

#8Alistair Hopkins
alistair@berthengron.co.uk
In reply to: Gordan Bobic (#7)
RE: Database cluster?

http://www.openlinksw.com/virtuoso/

would let you do this sort of thing.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Gordan Bobic
Sent: Thursday, November 30, 2000 9:02 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Database cluster?

I am considering splitting the database into tables residing on

separate

machines, and connect them on one master node.

The question I have is:

1) How can I do this using PostgreSQL?

You can't.

I'll jump in with a bit more info. Splitting tables across multiple
machines would do nothing more than make the entire system run at a

snail's

pace. Yes, it would slow it down immensely, because you just couldn't

move

data between machines quickly enough.

I don't believe that is the case. In my case, queries typically return
comparatively small amounts of data. Around 100 records at most. The amount
of data that needs to be transferred is comparatively small, and even over
10 Mb ethernet, it would take at most about a second to transfer. This is a
much smaller delay than the query time itself, which can take 10 seconds or
more. Remember that I said there are tables with over 30M records? Doing
multi-table joins on things like this takes a long time...

So, splitting the data in such a way that one table is queried, and then
tables joined from it are queried in parallel, would cause a signifficant
speed-up.

For example, say we have tables T1, T2 and T3.

T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least,
probably lots of other fields.

Say I want to do
SELECT *
FROM T1
WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata';

Then F1.3 could be searched for 'somedata'. When the records are found,
this could be cross-matched remotely, in parallel for F1.1=F2.1 and
F1.2=F3.1, on different machines.

This means that depending on the type, configuration and the usage of the
database, a potentially massive improvement in performance could be
achiveved, especially on multi-table joins which span lots of BIG tables.

Somebody mentioned the fact that postgres uses IPC for communicating
between processes. I think there are tools for clustering (I am not sure if
Mosix supports transparently allowing IPC across nodes) which can work
around that.

Why? Well, whenever you join two tables that are on different

machines,

the tables have to go across whatever sort of connection you have between
the machines. Even if you use gigabit ethernet, you are still running at

a

mere fraction of the bandwidth of the computer's internal bus - and at
orders of magnitude greater latency. You'd have lots of CPU's sitting
around, doing absolutely nothing, waiting for data to come across the

wire.

Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it all
depends on what machine you have running this. This would be true in the
case that the datbase server is a nice big Alpha with severl CPUs.

There are alternatives, such as IP-over-SCSI. That reduces the

latency

of ethernet quite a bit, and gives you much more bandwidth (say, up to

160

megabytes/second). However, that's still a pittance compared to the main
system bus inside your computer.

But SCSI is still 160MB burst (not sustained, unless you're using very
expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater
latency.

That's one of the greatest hurdles to distributed computing. That's

why

the applications that are best adapted to distributed computing are those
that don't require much data over the wire - which certainly doesn't

apply

to databases. : )

I think it depends whether the amount of data is the problem, or fitting it
together.

Somebody please explain to me further why I am wrong in all this?

Regards.

Gordan

#9Nathan Barnett
nbarnett@cellularphones.com
In reply to: Alistair Hopkins (#8)
RE: Database cluster?

This system is in use through what is called shared nothing clustering which
is employed by IBM's DB2 and Microsoft SQL Server 2000. Either of these
products will work in the manner that you are looking for.

---------------------------------------------
Nathan Barnett

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Gordan Bobic
Sent: Thursday, November 30, 2000 4:02 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Database cluster?

I am considering splitting the database into tables residing on

separate

machines, and connect them on one master node.

The question I have is:

1) How can I do this using PostgreSQL?

You can't.

I'll jump in with a bit more info. Splitting tables across multiple
machines would do nothing more than make the entire system run at a

snail's

pace. Yes, it would slow it down immensely, because you just couldn't

move

data between machines quickly enough.

I don't believe that is the case. In my case, queries typically return
comparatively small amounts of data. Around 100 records at most. The amount
of data that needs to be transferred is comparatively small, and even over
10 Mb ethernet, it would take at most about a second to transfer. This is a
much smaller delay than the query time itself, which can take 10 seconds or
more. Remember that I said there are tables with over 30M records? Doing
multi-table joins on things like this takes a long time...

So, splitting the data in such a way that one table is queried, and then
tables joined from it are queried in parallel, would cause a signifficant
speed-up.

For example, say we have tables T1, T2 and T3.

T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least,
probably lots of other fields.

Say I want to do
SELECT *
FROM T1
WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata';

Then F1.3 could be searched for 'somedata'. When the records are found,
this could be cross-matched remotely, in parallel for F1.1=F2.1 and
F1.2=F3.1, on different machines.

This means that depending on the type, configuration and the usage of the
database, a potentially massive improvement in performance could be
achiveved, especially on multi-table joins which span lots of BIG tables.

Somebody mentioned the fact that postgres uses IPC for communicating
between processes. I think there are tools for clustering (I am not sure if
Mosix supports transparently allowing IPC across nodes) which can work
around that.

Why? Well, whenever you join two tables that are on different

machines,

the tables have to go across whatever sort of connection you have between
the machines. Even if you use gigabit ethernet, you are still running at

a

mere fraction of the bandwidth of the computer's internal bus - and at
orders of magnitude greater latency. You'd have lots of CPU's sitting
around, doing absolutely nothing, waiting for data to come across the

wire.

Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it all
depends on what machine you have running this. This would be true in the
case that the datbase server is a nice big Alpha with severl CPUs.

There are alternatives, such as IP-over-SCSI. That reduces the

latency

of ethernet quite a bit, and gives you much more bandwidth (say, up to

160

megabytes/second). However, that's still a pittance compared to the main
system bus inside your computer.

But SCSI is still 160MB burst (not sustained, unless you're using very
expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater
latency.

That's one of the greatest hurdles to distributed computing. That's

why

the applications that are best adapted to distributed computing are those
that don't require much data over the wire - which certainly doesn't

apply

to databases. : )

I think it depends whether the amount of data is the problem, or fitting it
together.

Somebody please explain to me further why I am wrong in all this?

Regards.

Gordan

#10Gordan Bobic
gordan@freeuk.com
In reply to: Nathan Barnett (#9)
Re: Database cluster?

Thanks.

I have just had another thought. If all the tables are split across several
computers, this would help as well.

For example, if we have 100 records and 2 database servers, each server
could have 50 of those 100 records on it. When a selection is required,
each server would look through it's much smaller database, and report back
the "hits". This would, effectively, provide a near linear speedup in the
query time, while introducing only the minor network overhead (or a major
one, depending on how much data is transferred).

Some extra logic could then be implemented for related tables that would
allow the most closely related records from the different tables to be
"clustered" (as in kind of remotely similar to the CLUSTER command) on the
same server, for faster response time and minimized network usage
requirements. The "vacuum" or "cluster" features could be used overnight to
re-optimize the distribution of records across the servers.

In all this, a "master" node could be used for coordinating the whole
operation. We could ask the master node to do a query, and it would
automatically, knowing what slaves it has, fire off that query on them.
Each slave would then in parallel, execute a query, and return a subset of
the data we were looking for. This data would then be joined into one
recordset before it is returned to the client that requested it.

As far I can see, as long as the amounts of data shifted aren't huge enough
to cause problems with network congestion, and the query time is dominant
to data transfer time over the network, this should provide a rather
scaleable system. I understand that the form of database clustering I am
mentioning here is fairly rudimentary and unsophisticated, but it would
certaily be a very useful feature.

Are there any plans to implement this sort of functionality in PostgreSQL?
Or is this a lot more complicated than it seems...

Regards.

Gordan

----- Original Message -----
From: "Nathan Barnett" <nbarnett@cellularphones.com>
To: "'Gordan Bobic'" <gordan@freeuk.com>; <pgsql-general@postgresql.org>
Sent: Thursday, November 30, 2000 2:34 PM
Subject: RE: [GENERAL] Database cluster?

This system is in use through what is called shared nothing clustering

which

is employed by IBM's DB2 and Microsoft SQL Server 2000. Either of these
products will work in the manner that you are looking for.

---------------------------------------------
Nathan Barnett

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Gordan Bobic
Sent: Thursday, November 30, 2000 4:02 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Database cluster?

I am considering splitting the database into tables residing on

separate

machines, and connect them on one master node.

The question I have is:

1) How can I do this using PostgreSQL?

You can't.

I'll jump in with a bit more info. Splitting tables across multiple
machines would do nothing more than make the entire system run at a

snail's

pace. Yes, it would slow it down immensely, because you just couldn't

move

data between machines quickly enough.

I don't believe that is the case. In my case, queries typically return
comparatively small amounts of data. Around 100 records at most. The

amount

of data that needs to be transferred is comparatively small, and even

over

10 Mb ethernet, it would take at most about a second to transfer. This is

a

much smaller delay than the query time itself, which can take 10 seconds

or

more. Remember that I said there are tables with over 30M records? Doing
multi-table joins on things like this takes a long time...

So, splitting the data in such a way that one table is queried, and then
tables joined from it are queried in parallel, would cause a signifficant
speed-up.

For example, say we have tables T1, T2 and T3.

T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least,
probably lots of other fields.

Say I want to do
SELECT *
FROM T1
WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata';

Then F1.3 could be searched for 'somedata'. When the records are found,
this could be cross-matched remotely, in parallel for F1.1=F2.1 and
F1.2=F3.1, on different machines.

This means that depending on the type, configuration and the usage of the
database, a potentially massive improvement in performance could be
achiveved, especially on multi-table joins which span lots of BIG tables.

Somebody mentioned the fact that postgres uses IPC for communicating
between processes. I think there are tools for clustering (I am not sure

if

Mosix supports transparently allowing IPC across nodes) which can work
around that.

Why? Well, whenever you join two tables that are on different

machines,

the tables have to go across whatever sort of connection you have

between

the machines. Even if you use gigabit ethernet, you are still running

at

a

mere fraction of the bandwidth of the computer's internal bus - and at
orders of magnitude greater latency. You'd have lots of CPU's sitting
around, doing absolutely nothing, waiting for data to come across the

wire.

Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it

all

depends on what machine you have running this. This would be true in the
case that the datbase server is a nice big Alpha with severl CPUs.

There are alternatives, such as IP-over-SCSI. That reduces the

latency

of ethernet quite a bit, and gives you much more bandwidth (say, up to

160

megabytes/second). However, that's still a pittance compared to the

main

system bus inside your computer.

But SCSI is still 160MB burst (not sustained, unless you're using very
expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater
latency.

That's one of the greatest hurdles to distributed computing.

That's

why

the applications that are best adapted to distributed computing are

those

that don't require much data over the wire - which certainly doesn't

apply

to databases. : )

I think it depends whether the amount of data is the problem, or fitting

it

Show quoted text

together.

Somebody please explain to me further why I am wrong in all this?

Regards.

Gordan

#11Doug Semig
dougslist@semig.com
In reply to: Gordan Bobic (#10)
Re: Database cluster?

You're almost describing a Teradata DBM.

What an amazing machine! Last I heard about 6 years ago, though, AT&T was
rewriting it as an NT app instead of running on proprietary hardware. The
proprietary hardware was essentially a cluster of 80486 computers (at the
time).

What they had done was implemented a pyramid structure of 80486 computers.
The lowest level of computers had hard disks and stored the data. Two of
the lowest level computers would "report" to a single higher up computer.
Two of these higher up computers would "report" to yet another single
higher up computer until there was only one higher up computer to report to.

The thing that impacted me the most about this architecture was that
sorting was practically built in. So all the intermediary computers had to
do was merge the sorted result sets from its lower level computers. Blazing!

And data was stored on a couple of leaf-level computers for redundancy.

I miss that old beast. But I certainly cannot afford the multimillion
dollars required to get one for myself. We lovingly called the one we
worked with the "Pteradactyl," which is the old name for that bird-like
dinosaur (evidentally there's a new word for the bird-like dinosaur, the
pteronodon or something?).

Doug

At 02:44 PM 11/30/00 -0000, Gordan Bobic wrote:

Show quoted text

Thanks.

I have just had another thought. If all the tables are split across several
computers, this would help as well.

For example, if we have 100 records and 2 database servers, each server
could have 50 of those 100 records on it. When a selection is required,
each server would look through it's much smaller database, and report back
the "hits". This would, effectively, provide a near linear speedup in the
query time, while introducing only the minor network overhead (or a major
one, depending on how much data is transferred).

Some extra logic could then be implemented for related tables that would
allow the most closely related records from the different tables to be
"clustered" (as in kind of remotely similar to the CLUSTER command) on the
same server, for faster response time and minimized network usage
requirements. The "vacuum" or "cluster" features could be used overnight to
re-optimize the distribution of records across the servers.

In all this, a "master" node could be used for coordinating the whole
operation. We could ask the master node to do a query, and it would
automatically, knowing what slaves it has, fire off that query on them.
Each slave would then in parallel, execute a query, and return a subset of
the data we were looking for. This data would then be joined into one
recordset before it is returned to the client that requested it.

As far I can see, as long as the amounts of data shifted aren't huge enough
to cause problems with network congestion, and the query time is dominant
to data transfer time over the network, this should provide a rather
scaleable system. I understand that the form of database clustering I am
mentioning here is fairly rudimentary and unsophisticated, but it would
certaily be a very useful feature.

Are there any plans to implement this sort of functionality in PostgreSQL?
Or is this a lot more complicated than it seems...

Regards.

Gordan

#12Gordan Bobic
gordan@freeuk.com
In reply to: Nathan Barnett (#9)
Re: Database cluster?

You're almost describing a Teradata DBM.

I knew someone must have thought of it before. ;-)

[snip]

The thing that impacted me the most about this architecture was that
sorting was practically built in. So all the intermediary computers had

to

do was merge the sorted result sets from its lower level computers.

Blazing!

They effectively implemented a binary tree in hardware. One hell of an
indexing mechanism. :-)

I miss that old beast. But I certainly cannot afford the multimillion
dollars required to get one for myself.

I suppose it would depend on how many computers you want to have in this
cluster. The main reason why clusters are getting popular recently (albeit
not yet for databases, or so it would seem) is because it is cheaper than
anything else with similar performance.

The main question remains - are there any plans to implement something
similar to this with PostgreSQL? I would volunteer to help with some
coding, if a "group" was formed to work on this "clustering" module.

Regards.

Gordan

#13Doug Semig
dougslist@semig.com
In reply to: Gordan Bobic (#12)
Re: Database cluster?

I actually analyzed it once. I came to the conclusion that to do it right
it would be easier to make an almost entirely new db but use the same
external interfaces as PostgreSQL. To do a kludge of it, one might just
implement a tier that sits between the user and a bunch of standard
PostgreSQL backends.

It'd make a neat companion project, though. Like PG/Enterprise or
PG/Warehouse or something.

Doug

At 04:02 PM 11/30/00 -0000, Gordan Bobic wrote:

Show quoted text

You're almost describing a Teradata DBM.

I knew someone must have thought of it before. ;-)

[snip]

The thing that impacted me the most about this architecture was that
sorting was practically built in. So all the intermediary computers had

to

do was merge the sorted result sets from its lower level computers.

Blazing!

They effectively implemented a binary tree in hardware. One hell of an
indexing mechanism. :-)

I miss that old beast. But I certainly cannot afford the multimillion
dollars required to get one for myself.

I suppose it would depend on how many computers you want to have in this
cluster. The main reason why clusters are getting popular recently (albeit
not yet for databases, or so it would seem) is because it is cheaper than
anything else with similar performance.

The main question remains - are there any plans to implement something
similar to this with PostgreSQL? I would volunteer to help with some
coding, if a "group" was formed to work on this "clustering" module.

Regards.

Gordan

#14Alain Toussaint
nailed@videotron.ca
In reply to: Gordan Bobic (#7)
Re: Database cluster?

Somebody mentioned the fact that postgres uses IPC for communicating
between processes. I think there are tools for clustering (I am not sure if
Mosix supports transparently allowing IPC across nodes) which can work
around that.

one of those tool is distributed ipc <http://wallybox.cei.net/dipc/&gt; but
it only work with Linux,AFAIK,the software there is just a patch to the
Linux kernel and a daemon.

Alain

#15Peter Korsgaard
jacmet@control.auc.dk
In reply to: Doug Semig (#13)
Re: Database cluster?

On Thu, 30 Nov 2000, Doug Semig wrote:

I actually analyzed it once. I came to the conclusion that to do it right
it would be easier to make an almost entirely new db but use the same
external interfaces as PostgreSQL. To do a kludge of it, one might just
implement a tier that sits between the user and a bunch of standard
PostgreSQL backends.

It'd make a neat companion project, though. Like PG/Enterprise or
PG/Warehouse or something.

I'm currently developing a simple version of such a system as an
university project. It is a fairly simple aproach with a proxy or a
distributor in front of a bunch of standard postgresl database servers.

The proxy monitors and forwards the requests from the clients to the
database servers. If it is a read-only request the query is forwarded to
the databaseserver currently experiencing the lowest load/most free
memory, otherwise it is sent to all database servers.

This approach obviously only performs well in systems with a high ratio of
read-only queries, such as search engines and so on.

--
Bye, Peter Korsgaard

#16Steve Wolfe
steve@iboats.com
In reply to: Peter Korsgaard (#15)
Re: Database cluster?

The proxy monitors and forwards the requests from the clients to the
database servers. If it is a read-only request the query is forwarded to
the databaseserver currently experiencing the lowest load/most free
memory, otherwise it is sent to all database servers.

This approach obviously only performs well in systems with a high ratio of
read-only queries, such as search engines and so on.

The tough part is syncronicity, should one of the machines drop out of the
cluster and need to be re-added without bringing the others down. In order
to get around that, each query needs to be logged on the master node with a
timestamp, so that the failed node can "catch up" in real-time. That brings
about other considerations, as well....

steve

#17Peter Korsgaard
jacmet@control.auc.dk
In reply to: Steve Wolfe (#16)
Re: Database cluster?

On Thu, 30 Nov 2000, Steve Wolfe wrote:

The tough part is syncronicity, should one of the machines drop out of the
cluster and need to be re-added without bringing the others down. In order
to get around that, each query needs to be logged on the master node with a
timestamp, so that the failed node can "catch up" in real-time. That brings
about other considerations, as well....

Exactly! Thats also why I have decided not to implement that feature as
the deadline is in 3 weeks ;) If a database server fails it is simple
discarded from the cluster and can only be reconnected by taking the
system offline and doing the syncronisation manually.

--
Bye, Peter Korsgaard

#18Valter Mazzola
txian@hotmail.com
In reply to: Peter Korsgaard (#17)
Re: Database cluster?

I've succesfully pacthed linux kernel 2.2.17 with DIPC and modified
postgresql's src (src/backend/storage/ipc/ipc.c) to create distributed shm
and sem.

The strategy is then to start a postgresql that creates shm and sem on ONE
machine, then start other postgres on other machines on the cluster that
create NO shared structures ( there is a command line flag to do this).

Than you can connect to any of the postgres on your cluster, for example:
round robin.

Another issue are datafiles, GFS seems promising.
But postgresql uses fcnl, and GFS (globalfilesystem.org) doesn't support it
yet.
A distributed filesystem with locking etc. is required, Ideas ?

Another issue is that DIPC doesn't have a failover mechanism.

This is a shared All approach, it's not the best, but probably it's the
fastest solution (bad) to implement, with little modifications (4-5) lines
to postgresql sources.

This system can give a sort of single-system-image, useful to distribute
other software beyond postgresql.

Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems
interesting,
but it's not maintained and it's for an old postgresql version.

hoping for clustrering...
valter mazzola.

From: Alain Toussaint <nailed@videotron.ca>
To: PostgreSQL general list <pgsql-general@postgresql.org>
Subject: Re: [GENERAL] Database cluster?
Date: Thu, 30 Nov 2000 15:05:16 -0500 (EST)

Somebody mentioned the fact that postgres uses IPC for communicating
between processes. I think there are tools for clustering (I am not sure

if

Mosix supports transparently allowing IPC across nodes) which can work
around that.

one of those tool is distributed ipc <http://wallybox.cei.net/dipc/&gt; but
it only work with Linux,AFAIK,the software there is just a patch to the
Linux kernel and a daemon.

Alain

_____________________________________________________________________________________
Get more from the Web. FREE MSN Explorer download : http://explorer.msn.com

#19Tim Kientzle
kientzle@acm.org
In reply to: Valter Mazzola (#18)
Re: Database cluster?

I am considering splitting the database into
tables residing on separate machines, and connect
them on one master node.

Splitting tables across multiple machines would do
nothing more than make the entire system run at a
snail's pace . . . because you just couldn't move
data between machines quickly enough.

In my case, queries typically return ... [a]round 100
records at most. [E]ven over 10 Mb ethernet, it would
take at most about a second to transfer. This is a
much smaller delay than the query time itself, which
can take 10 seconds or more.

So, splitting the data in such a way that one table is
queried, and then tables joined from it are queried in
parallel, would cause a signifficant speed-up.

Then do exactly that: run separate PostgreSQL databases on
multiple machines and build a data abstraction layer that
does the join manually. If there really are only small
numbers of rows, then it's just as fast to transfer the
data to your application machine as to transfer it all
to a common PostgreSQL machine and then to your application.

If you really need application transparency, then things get
a bit uglier; it shouldn't be too hard to build your own
middleware layer that lets you treat the data storage as
a single entity.

- Tim

#20Gordan Bobic
gordan@freeuk.com
In reply to: Nathan Barnett (#9)
Re: Database cluster?

I actually analyzed it once. I came to the conclusion that to do it

right

it would be easier to make an almost entirely new db but use the same
external interfaces as PostgreSQL.

I admit that I am not really too up-to-date on database theory, but I am a
bit surprised at that...

To do a kludge of it, one might just implement a tier that sits between

the

user and a bunch of standard PostgreSQL backends.

That is precisely what I was thinking about. There would have to be a
"master" node that controls what goes where, and distributed the load. This
"shouldn't" be too difficult (although I am not totally sure what I mean by
that). The nasty bit would probably be hacking the optimizer, SQL command
"CLUSTER", and VACUUM to take account and efficiently use all the extra
room for improving the performance.

Automating a "near-optimal" distribution of tables across machines could be
a bit of a difficult problem from a theory side, but it ought to be
possible. There are several options here.

One could just put one table on each server, which is unlikely to be all
that beneficial, although in a multi-table join, you'd want to search the
smallest tables first.

Then, there's the option of just splitting each table across multiple
machines. There is also the possibility of having some records overlap
between machines, of the on-line optimizer decides that that would be
useful for performance, and then sort out the syncing somehow.

Or, one could set up an even more sophisticated system where only the
tables and data that would benefit from being together would be on the same
server, so there could be a section of two tables on one server, the rest
of those two tables and a section of another table on another server, etc.
Basically, make both the table and record allocations completely dynamic
between the servers.

I am not sure how useful each of these splits would be, but it is certainly
something well worth exploring theoretically before the actual
implementation, because I reserve the right to be wrong in thinking that
any of these methods would produce an actual improvement in performance.

And, of course, there would be the bit of getting the optimizer and partial
replication to work properly across servers, which may not be an easy task.

It'd make a neat companion project, though. Like PG/Enterprise or
PG/Warehouse or something.

I agree. It would be really neat. Something like Mosix, but for databases.
And it just sounds like something that would be really useful for large
databases, especially as we start reaching steep part of the
price/performance curve for database servers.

Regards.

Gordan

At 04:02 PM 11/30/00 -0000, Gordan Bobic wrote:

You're almost describing a Teradata DBM.

I knew someone must have thought of it before. ;-)

[snip]

The thing that impacted me the most about this architecture was that
sorting was practically built in. So all the intermediary computers

had

to

do was merge the sorted result sets from its lower level computers.

Blazing!

They effectively implemented a binary tree in hardware. One hell of an
indexing mechanism. :-)

I miss that old beast. But I certainly cannot afford the multimillion
dollars required to get one for myself.

I suppose it would depend on how many computers you want to have in this
cluster. The main reason why clusters are getting popular recently

(albeit

not yet for databases, or so it would seem) is because it is cheaper

than

Show quoted text

anything else with similar performance.

The main question remains - are there any plans to implement something
similar to this with PostgreSQL? I would volunteer to help with some
coding, if a "group" was formed to work on this "clustering" module.

Regards.

Gordan

#21Gordan Bobic
gordan@freeuk.com
In reply to: Peter Korsgaard (#15)
#22Peter Korsgaard
jacmet@control.auc.dk
In reply to: Gordan Bobic (#21)
#23Gordan Bobic
gordan@freeuk.com
In reply to: Peter Korsgaard (#15)
#24Gordan Bobic
gordan@freeuk.com
In reply to: Peter Korsgaard (#17)
#25Gordan Bobic
gordan@freeuk.com
In reply to: Valter Mazzola (#18)
#26Gordan Bobic
gordan@freeuk.com
In reply to: Tim Kientzle (#19)
#27Gordan Bobic
gordan@freeuk.com
In reply to: Peter Korsgaard (#22)
#28Dave Smith
dave@candata.com
In reply to: Valter Mazzola (#18)
#29Valter Mazzola
txian@hotmail.com
In reply to: Dave Smith (#28)
#30Gordan Bobic
gordan@freeuk.com
In reply to: Valter Mazzola (#29)
#31Alain Toussaint
nailed@videotron.ca
In reply to: Gordan Bobic (#25)