postgresql clustering

Started by Rafik Salamaover 20 years ago19 messageshackers

rafikamir@gmail.com

over 20 years ago

Dear Sirs

I know that that postgresql can be configured for high availability over a
clustered environment using pgcluster, I am currently studying in my masters
the clustering using MPI and OpenMP, PVM and others packages and I have to
do a project, so I was thinking to use this opportunity to start
implementing the clustering over postgresql using any of the above packages.

What do you think?

Thanks

Rafik Salama

Systems Architect

CIT Global

CIT Building, Free Zone

Nasr City,

P.O.Box 11816, Cairo, Egypt

Tel : +202 271 8794 (ext. 115)

Fax : +202 2748335

Cell: +2010 5410035

http://www.citglobal.com

David Fetter

david@fetter.org

over 20 years ago

In reply to: Rafik Salama (#1)

Re: postgresql clustering

On Wed, Sep 21, 2005 at 08:01:08PM +0300, Rafik Salama wrote:

Dear Sirs

I know that that postgresql can be configured for high availability
over a clustered environment using pgcluster,

Do you have a case study showing this?

I am currently studying in my masters the clustering using MPI and
OpenMP, PVM and others packages and I have to do a project, so I was
thinking to use this opportunity to start implementing the
clustering over postgresql using any of the above packages.

What do you think?

Let a thousand schools of thought content. Let a hundred flowers
bloom.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778

Remember to vote!

Aly Dharshi

aly.dharshi@telus.net

over 20 years ago

In reply to: David Fetter (#2)

Re: postgresql clustering

I think its a great idea to give it a shot, maybe you can present a
proposal to the list of how you wish to go about it. There could be some
experts on the list who may give you some input and direction.

Aly.

David Fetter wrote:

On Wed, Sep 21, 2005 at 08:01:08PM +0300, Rafik Salama wrote:

Dear Sirs

I know that that postgresql can be configured for high availability
over a clustered environment using pgcluster,

Do you have a case study showing this?

I am currently studying in my masters the clustering using MPI and
OpenMP, PVM and others packages and I have to do a project, so I was
thinking to use this opportunity to start implementing the
clustering over postgresql using any of the above packages.

What do you think?

Let a thousand schools of thought content. Let a hundred flowers
bloom.

Cheers,
D

--
Aly Dharshi
aly.dharshi@telus.net

"A good speech is like a good dress
that's short enough to be interesting
and long enough to cover the subject"

Rafik Salama

rafikamir@gmail.com

over 20 years ago

In reply to: David Fetter (#2)

Re: postgresql clustering

No I do not have a case study, I just read so, but what I am suggesting to
start doing is that if there is no cluster implementation to give high
availability of the database, I will start doing this project through the
message passing technique and I already have in the university a cluster of
19 machine intel xeon, you can see it in this URL
http://www.cs.aucegypt.edu/~cluster

But any way I was just asking so as not to reinvent the Wheel, in case there
is something like that, but since there is not, I will give it a try, at the
end of the day it is open source and I can do anything and if it happens to
work, who knows!!!!

Thanks

Rafik Salama
Systems Architect

CIT Global
CIT Building, Free Zone
Nasr City,
P.O.Box 11816, Cairo, Egypt
Tel : +202 271 8794 (ext. 115)
Fax : +202 2748335
Cell: +2010 5410035
http://www.citglobal.com

-----Original Message-----
From: David Fetter [mailto:david@fetter.org]
Sent: Wednesday, September 21, 2005 8:12 PM
To: Rafik Salama
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] postgresql clustering

On Wed, Sep 21, 2005 at 08:01:08PM +0300, Rafik Salama wrote:

Dear Sirs

I know that that postgresql can be configured for high availability
over a clustered environment using pgcluster,

Do you have a case study showing this?

I am currently studying in my masters the clustering using MPI and
OpenMP, PVM and others packages and I have to do a project, so I was
thinking to use this opportunity to start implementing the
clustering over postgresql using any of the above packages.

What do you think?

Let a thousand schools of thought content. Let a hundred flowers
bloom.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778

Remember to vote!

Jonah H. Harris

jonah.harris@gmail.com

over 20 years ago

In reply to: Rafik Salama (#4)

Re: postgresql clustering

In the past couple years I've worked on several personal/business projects
to cluster PostgreSQL and InnoDB (without MySQL). I've tested
shared-nothing, shared-memory, and shared-disk models. IMHO, shared-disk is
the only viable option for performance and/or large production business
environments. Using shared-memory or shared-nothing architectures in a
database are fine for high-availability, but are expensive from a
business-case for added performance. I'd be happy to share any of my
clustering knowledge with ya offline. Have fun!

On 9/21/05, Rafik Salama <rafikamir@gmail.com> wrote:

No I do not have a case study, I just read so, but what I am suggesting to
start doing is that if there is no cluster implementation to give high
availability of the database, I will start doing this project through the
message passing technique and I already have in the university a cluster
of
19 machine intel xeon, you can see it in this URL
http://www.cs.aucegypt.edu/~cluster

But any way I was just asking so as not to reinvent the Wheel, in case
there
is something like that, but since there is not, I will give it a try, at
the
end of the day it is open source and I can do anything and if it happens
to
work, who knows!!!!

Thanks

Rafik Salama
Systems Architect

CIT Global
CIT Building, Free Zone
Nasr City,
P.O.Box 11816, Cairo, Egypt
Tel : +202 271 8794 (ext. 115)
Fax : +202 2748335
Cell: +2010 5410035
http://www.citglobal.com

-----Original Message-----
From: David Fetter [mailto:david@fetter.org]
Sent: Wednesday, September 21, 2005 8:12 PM
To: Rafik Salama
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] postgresql clustering

On Wed, Sep 21, 2005 at 08:01:08PM +0300, Rafik Salama wrote:

Dear Sirs

I know that that postgresql can be configured for high availability
over a clustered environment using pgcluster,

Do you have a case study showing this?

I am currently studying in my masters the clustering using MPI and
OpenMP, PVM and others packages and I have to do a project, so I was
thinking to use this opportunity to start implementing the
clustering over postgresql using any of the above packages.

What do you think?

Let a thousand schools of thought content. Let a hundred flowers
bloom.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778

Remember to vote!

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

Daniel Duvall

the.liberal.media@gmail.com

over 20 years ago

In reply to: Jonah H. Harris (#5)

Re: postgresql clustering

Jonah,

I stumbled on this discussion in one of my recurring searches for an
open-source database app capable of true clustering (failover, load
balancing, etc) that I can pair with my PHP application. A search
that, sadly, most often ends in disappointment -- there's tons and tons
of database marketing BS out there.

Part of my frustration is do to my lack of a real understanding of the
models you mentioned in your comment. I've been searching for
meaningful text and comparisons of the different clustering models, but
have yet to find anything that truely breaks it down well (and deep).

Could you perhaps point me -- and anyone else that happens upon this
post with the same frustrations -- in the right direction?

I've looked at PostgreSQL and EnterpriseDB, but I can't find anything
definitive as far as clustering capabilities. What kinds of projects
are there for clustering PgSQL, and are any of them mature enough for
commercial apps?

Best,
Dan

"Jonah H. Harris" wrote:

Show quoted text

In the past couple years I've worked on several personal/business projects
to cluster PostgreSQL and InnoDB (without MySQL). I've tested
shared-nothing, shared-memory, and shared-disk models. IMHO, shared-disk is
the only viable option for performance and/or large production business
environments. Using shared-memory or shared-nothing architectures in a
database are fine for high-availability, but are expensive from a
business-case for added performance. I'd be happy to share any of my
clustering knowledge with ya offline. Have fun!

On 9/21/05, Rafik Salama <rafikamir@gmail.com> wrote:

No I do not have a case study, I just read so, but what I am suggesting to
start doing is that if there is no cluster implementation to give high
availability of the database, I will start doing this project through the
message passing technique and I already have in the university a cluster
of
19 machine intel xeon, you can see it in this URL
http://www.cs.aucegypt.edu/~cluster

But any way I was just asking so as not to reinvent the Wheel, in case
there
is something like that, but since there is not, I will give it a try, at
the
end of the day it is open source and I can do anything and if it happens
to
work, who knows!!!!

Thanks

Rafik Salama
Systems Architect

CIT Global
CIT Building, Free Zone
Nasr City,
P.O.Box 11816, Cairo, Egypt
Tel : +202 271 8794 (ext. 115)
Fax : +202 2748335
Cell: +2010 5410035
http://www.citglobal.com

-----Original Message-----
From: David Fetter [mailto:david@fetter.org]
Sent: Wednesday, September 21, 2005 8:12 PM
To: Rafik Salama
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] postgresql clustering

On Wed, Sep 21, 2005 at 08:01:08PM +0300, Rafik Salama wrote:

Dear Sirs

I know that that postgresql can be configured for high availability
over a clustered environment using pgcluster,

Do you have a case study showing this?

I am currently studying in my masters the clustering using MPI and
OpenMP, PVM and others packages and I have to do a project, so I was
thinking to use this opportunity to start implementing the
clustering over postgresql using any of the above packages.

What do you think?

Let a thousand schools of thought content. Let a hundred flowers
bloom.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778

Remember to vote!

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

Gaetano Mendola

mendola@bigfoot.com

over 20 years ago

In reply to: Daniel Duvall (#6)

Re: postgresql clustering

Daniel Duvall wrote:

I've looked at PostgreSQL and EnterpriseDB, but I can't find anything
definitive as far as clustering capabilities. What kinds of projects
are there for clustering PgSQL, and are any of them mature enough for
commercial apps?

As you well know "clustering" means all and nothing at the same time.
We do have a commercial failover cluster for provided by Redhat,
with postgres running on it. The Postgres is installed on both nodes and the
data are stored on SAN, only one instance of postgres run at time in one
of two nodes. In last 2 years we had a failure and the service relocation
worked as expected.

Consider also that applications shall have a good behaviour like "try" to
close the current connection and retry to open a new one for a while....

Regards
Gaetano Mendola

Joshua D. Drake

jd@commandprompt.com

over 20 years ago

In reply to: Gaetano Mendola (#7)

Re: postgresql clustering

Gaetano Mendola wrote:

Daniel Duvall wrote:

I've looked at PostgreSQL and EnterpriseDB, but I can't find anything
definitive as far as clustering capabilities. What kinds of projects
are there for clustering PgSQL, and are any of them mature enough for
commercial apps?

Are you looking for clustering or replication? There are two very
popular replication
solutions: Slony-I and Mammoth Replicator.

Slony-I is an external replication solution, Mammoth Replicator is a
complete
PostgreSQL + Replication solution.

Sincerely,

Joshua D. Drake

As you well know "clustering" means all and nothing at the same time.
We do have a commercial failover cluster for provided by Redhat,
with postgres running on it. The Postgres is installed on both nodes and the
data are stored on SAN, only one instance of postgres run at time in one
of two nodes. In last 2 years we had a failure and the service relocation
worked as expected.

Consider also that applications shall have a good behaviour like "try" to
close the current connection and retry to open a new one for a while....

Regards
Gaetano Mendola

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/

Daniel Duvall

the.liberal.media@gmail.com

over 20 years ago

In reply to: Gaetano Mendola (#7)

Re: postgresql clustering

While "clustering" in some circles may be an open-ended buzzword --
mainly the commercial DB marketing crowd -- there are concepts beneath
the bull that are even inherent in the name. However, I understand
your point.

From what I've researched, the concepts and practices seem to fall

under one of two abstract categorizations: fail-over (ok...
high-availability), and parallel execution (high-performance... sure).
While some consider the implementation of only one of these to qualify
a cluster, others seem to demand that a "true" cluster must
implement both.

What I'm really after is a DB setup that does fail-over and parallel
execution. Your setup sounds like it would gracefully handle the
former, but cannot achieve the latter. Perhaps I'm simply asking too
much of a free software setup.

Thanks for your response.

#10

Tino Wildenhain

tino@wildenhain.de

over 20 years ago

In reply to: Daniel Duvall (#9)

Re: postgresql clustering

Daniel Duvall schrieb:

While "clustering" in some circles may be an open-ended buzzword --
mainly the commercial DB marketing crowd -- there are concepts beneath
the bull that are even inherent in the name. However, I understand
your point.

From what I've researched, the concepts and practices seem to fall

under one of two abstract categorizations: fail-over (ok...
high-availability), and parallel execution (high-performance... sure).

Well, I dont know why many people believe parallel execution
automatically means high performance. Actually most of the time
the performance is much worser this way.
If your dataset remains statically and you do only read-only
requets, you get higher performance thru load-balancing.
If howewer you do some changes to the data, the change has to
be propagated to all nodes - which in fact costs performance.
This highly depends on the link speed between the nodes.

While some consider the implementation of only one of these to qualify
a cluster, others seem to demand that a "true" cluster must
implement both.

What I'm really after is a DB setup that does fail-over and parallel
execution. Your setup sounds like it would gracefully handle the
former, but cannot achieve the latter. Perhaps I'm simply asking too
much of a free software setup.

commercial vendors arent much better here - they just dont tell you :-)
There is pgpool or SQLRelay for example if you want to parallelize
requests, you can combine with the various replication mechanism
also available for PG and get what you want - and most important
- get whats possible. Nobody can trick the math :-)

Greets
Tino

#11

Jonah H. Harris

jonah.harris@gmail.com

over 20 years ago

In reply to: Tino Wildenhain (#10)

Re: postgresql clustering

On 9/29/05, Tino Wildenhain <tino@wildenhain.de> wrote:

Well, I dont know why many people believe parallel execution
automatically means high performance. Actually most of the time
the performance is much worser this way.
If your dataset remains statically and you do only read-only
requets, you get higher performance thru load-balancing.
If howewer you do some changes to the data, the change has to
be propagated to all nodes - which in fact costs performance.
This highly depends on the link speed between the nodes.

I think you should clarify that the type of clustering you're discussing is
the, "shared-nothing" model which is most prevalent in open-source
databases. Shared-disk and shared-memory clustered systems do not have the
"propagation" issue but do have others (distributed lock manager, etc).
Don't make blind statements. If you want more information about "real-world"
clustering, read the research for DB2 (Mainframe) and Oracle RAC.

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#12

Luke Lonergan

llonergan@greenplum.com

over 20 years ago

In reply to: Jonah H. Harris (#11)

Re: postgresql clustering

Daniel,

From what I've researched, the concepts and practices seem to fall
under one of two abstract categorizations: fail-over (ok...
high-availability), and parallel execution (high-performance... sure).
While some consider the implementation of only one of these to qualify
a cluster, others seem to demand that a "true" cluster must
implement both.

If you want to get a high degree of parallelism, 10s or 100s of machines are required. At that size, you must have fault tolerance to make the ystem usable.

What I'm really after is a DB setup that does fail-over and parallel
execution. Your setup sounds like it would gracefully handle the
former, but cannot achieve the latter. Perhaps I'm simply asking too
much of a free software setup.

We've spent the last 3 years developing a parallel database that does both and I can tell you that it takes a huge development effort to get it right for the general audience. Bizgres MPP is capable of handling ANSI SQL, is ACID compliant and scales to tens of terabytes, but it's not free (sorry about that). It is tons cheaper than Oracle or Teradata though, and it's based on Postgres.

- Luke

Import Notes

Resolved by subject fallback

#13

Gaetano Mendola

mendola@bigfoot.com

over 20 years ago

In reply to: Daniel Duvall (#9)

Re: postgresql clustering

Daniel Duvall wrote:

While "clustering" in some circles may be an open-ended buzzword --
mainly the commercial DB marketing crowd -- there are concepts beneath
the bull that are even inherent in the name. However, I understand
your point.

From what I've researched, the concepts and practices seem to fall

under one of two abstract categorizations: fail-over (ok...
high-availability), and parallel execution (high-performance... sure).
While some consider the implementation of only one of these to qualify
a cluster, others seem to demand that a "true" cluster must
implement both.

What I'm really after is a DB setup that does fail-over and parallel
execution. Your setup sounds like it would gracefully handle the
former, but cannot achieve the latter. Perhaps I'm simply asking too
much of a free software setup.

Thanks for your response.

Also consider the PITR and some work I did last year:
http://archives.postgresql.org/pgsql-admin/2005-06/msg00013.php

With PITR you can have one or more remote machine/s that
continuously replay log from main, and if the main crash
the "mirrors" can come out from their reply and go "on line".

At that time was not possible connect to a "replayng" engine
to perform ( at least ) queries, dunno if this changed in 8.1

BTW, did someone go further with that idea? If not I'd like rewrite
that stuff in C ( I do prefer C++ ).

Regards
Gaetano Mendola

#14

Tino Wildenhain

tino@wildenhain.de

over 20 years ago

In reply to: Jonah H. Harris (#11)

Re: postgresql clustering

Jonah H. Harris schrieb:

On 9/29/05, *Tino Wildenhain* <tino@wildenhain.de
<mailto:tino@wildenhain.de>> wrote:

Well, I dont know why many people believe parallel execution
automatically means high performance. Actually most of the time
the performance is much worser this way.
If your dataset remains statically and you do only read-only
requets, you get higher performance thru load-balancing.
If howewer you do some changes to the data, the change has to
be propagated to all nodes - which in fact costs performance.
This highly depends on the link speed between the nodes.

I think you should clarify that the type of clustering you're discussing
is the, "shared-nothing" model which is most prevalent in open-source
databases. Shared-disk and shared-memory clustered systems do not have
the "propagation" issue but do have others (distributed lock manager,
etc). Don't make blind statements. If you want more information about
"real-world" clustering, read the research for DB2 (Mainframe) and
Oracle RAC.

No, thats not a blind statement ;) It does not matter how the
information is technically shared - shared mem must be
copied or accessed over network links if you have more then
one independend system. Locks are informations too - thus the
same constraints apply.

So no matter how you label the problem, the basic constraints:
read communication and synchronisation overhead will remain.

Costom solutions can circumvent some of the problems if you
can shift the problem area (e.g. have some read-only areas,
some seldom-write areas and some high write, some seldom read
and not immediately propagated data)

#15

Daniel Duvall

the.liberal.media@gmail.com

over 20 years ago

In reply to: Luke Lonergan (#12)

Re: postgresql clustering

Thanks for your reply Luke.

Bizgres looks like a very promissing project. I'll be sure to follow
it.

Thanks to everyone for their comments. I'm starting to understand the
truth behind the hype and where these performance gains and hits stem
from.

-Dan

#16

Daniel Duvall

the.liberal.media@gmail.com

over 20 years ago

In reply to: Jonah H. Harris (#11)

Re: postgresql clustering

What about clustered filesystems? At first blush I would think the
overhead of something like GFS might kill performance. Could one
potentially achieve a fail-over config using multiple nodes with GFS,
each having there own instance of PostgreSQL (but only one running at
any given moment)?

Best,
Dan

#17

Trent Shipley

tshipley@deru.com

over 20 years ago

In reply to: Daniel Duvall (#9)

Fwd: Re: postgresql clustering

What is the relationship between database support for clustering and grid
computing and support for distributed databases?

Two-phase COMMIT is comming in 8.1. What effect will this have in promoting
FOSS grid support or distribution solutions for Postgresql?

#18

Luke Lonergan

llonergan@greenplum.com

over 20 years ago

In reply to: Daniel Duvall (#16)

Re: postgresql clustering

Dan,

On 9/29/05 3:23 PM, "Daniel Duvall" <the.liberal.media@gmail.com> wrote:

What about clustered filesystems? At first blush I would think the
overhead of something like GFS might kill performance. Could one
potentially achieve a fail-over config using multiple nodes with GFS,
each having there own instance of PostgreSQL (but only one running at
any given moment)?

Interestingly - my friend Matt O'Keefe built GFS at UMN, I was one of his
first customers/sponsors of the research in 1998 when I implemented an
8-node shared disk cluster on Alpha Linux using GFS and Fibre Channel.

Again - it depends on what you're doing - if it's OLTP, you will spend too
much time in lock management for disk access and things like Oracle RAC's
CacheFusion becomes critical to reduce the number of times you have to hit
disks. For warehousing/sequential scans, this kind of clustering is
irrelevant.

- Luke

#19

Hans-Jürgen Schönig

postgres@cybertec.at

over 20 years ago

In reply to: Luke Lonergan (#18)

Re: postgresql clustering

Luke Lonergan wrote:

Dan,

On 9/29/05 3:23 PM, "Daniel Duvall" <the.liberal.media@gmail.com> wrote:

What about clustered filesystems? At first blush I would think the
overhead of something like GFS might kill performance. Could one
potentially achieve a fail-over config using multiple nodes with GFS,
each having there own instance of PostgreSQL (but only one running at
any given moment)?

Interestingly - my friend Matt O'Keefe built GFS at UMN, I was one of his
first customers/sponsors of the research in 1998 when I implemented an
8-node shared disk cluster on Alpha Linux using GFS and Fibre Channel.

Again - it depends on what you're doing - if it's OLTP, you will spend too
much time in lock management for disk access and things like Oracle RAC's
CacheFusion becomes critical to reduce the number of times you have to hit
disks.

Hitting the disk is really bad. However, we have seen that consulting
the network for small portions of data (e.g. locks) is even more
critical. you will see that the CPU on all nodes is running at 1% or so
while the network is waiting for data to be exchanged (latency) - this
is the real problem.

i don't know what oracle is doing in detail but they have real problem
when losing a node inside the cluster (syncing again is really time
consuming).

For warehousing/sequential scans, this kind of clustering is
irrelevant.

I suggest to look at Teradata - for do really nice query partitioning on
so called AMPs (we'd simply call it node). It is really nice for really
ugly warehousing queries (ugly in terms of amount of data).

Hans

--
Cybertec Geschwinde & Schï¿½nig GmbH
Schï¿½ngrabern 134; A-2020 Hollabrunn
Tel: +43/1/205 10 35 / 340
www.postgresql.at, www.cybertec.at