Parallell Optimizer

Started by Fred&Dani&Pandora&Aquilesalmost 13 years ago34 messageshackers

fred@nti.ufop.br

almost 13 years ago

I asked a while ago in this group about the possibility to implement a
parallel planner in a multithread way, and the replies were that the
proposed approach couldn't be implemented, because the postgres is not
thread-safe. With the new feature Background Worker Processes, such
implementation would be possible? If yes, do you can see possible problems
in implement this approach, for example, the bgprocess can't access some
planning core functions like make_join_rel, acess them in parallel and so
on. I want start a research to work in a parallel planner in postgres, I
succeeded in in the DBMS H2, but my first option still is the postgres, and
any help is welcome.

Att,

Fred

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Fred&Dani&Pandora&Aquiles (#1)

Re: Parallell Optimizer

"Fred&Dani&Pandora&Aquiles" <fred@nti.ufop.br> writes:

I asked a while ago in this group about the possibility to implement a
parallel planner in a multithread way, and the replies were that the
proposed approach couldn't be implemented, because the postgres is not
thread-safe. With the new feature Background Worker Processes, such
implementation would be possible?

I don't think that bgworkers as currently implemented make this any more
practical than it was before. The communication overhead with a
separate process would swamp any benefit in most cases.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Tom Lane (#2)

Re: Parallell Optimizer

On Fri, Jun 7, 2013 at 2:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Fred&Dani&Pandora&Aquiles" <fred@nti.ufop.br> writes:

I asked a while ago in this group about the possibility to implement a
parallel planner in a multithread way, and the replies were that the
proposed approach couldn't be implemented, because the postgres is not
thread-safe. With the new feature Background Worker Processes, such
implementation would be possible?

I don't think that bgworkers as currently implemented make this any more
practical than it was before. The communication overhead with a
separate process would swamp any benefit in most cases.

I agree this can't be done yet, but I don't agree with that reasoning.
I would articulate it this way: we don't have parallel execution,
therefore how could we meaningfully do parallel optimization?

I'm baffled by your statement that the communication overhead would be
too high. What IPC mechanism are you presuming, and why would it be
any more expensive in PostgreSQL than in any other database (a number
of which do have parallel query execution)?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Robert Haas (#3)

Re: Parallell Optimizer

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jun 7, 2013 at 2:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't think that bgworkers as currently implemented make this any more
practical than it was before. The communication overhead with a
separate process would swamp any benefit in most cases.

I'm baffled by your statement that the communication overhead would be
too high. What IPC mechanism are you presuming, and why would it be
any more expensive in PostgreSQL than in any other database (a number
of which do have parallel query execution)?

Well, right at the moment we don't have *any* IPC mechanism that would
work, so the cost is infinity. But the real issues here are the same
as we touched on in the recent round of discussions about parallelism:
you'd have to export snapshots, locks, etc to another process before it
could start taking over any planning work for you, and once you did have
all the context shared, there would still be a tremendous amount of
two-way communication needed, because the parallelizable units of work
are not very large. There's too much work yet to be done before this is
remotely practical.

As for other databases, I suspect that ones that have parallel execution
are probably doing it with a thread model not a process model.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Tom Lane (#4)

Re: Parallell Optimizer

On Fri, Jun 7, 2013 at 3:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jun 7, 2013 at 2:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't think that bgworkers as currently implemented make this any more
practical than it was before. The communication overhead with a
separate process would swamp any benefit in most cases.

I'm baffled by your statement that the communication overhead would be
too high. What IPC mechanism are you presuming, and why would it be
any more expensive in PostgreSQL than in any other database (a number
of which do have parallel query execution)?

Well, right at the moment we don't have *any* IPC mechanism that would
work, so the cost is infinity. But the real issues here are the same
as we touched on in the recent round of discussions about parallelism:
you'd have to export snapshots, locks, etc to another process before it
could start taking over any planning work for you, and once you did have
all the context shared, there would still be a tremendous amount of
two-way communication needed, because the parallelizable units of work
are not very large. There's too much work yet to be done before this is
remotely practical.

Well, I'm not as pessimistic as all that, but I agree there's a good
deal of work to be done. I don't really see why the units of
parallelizable work can't be large. Indeed, I'd argue that if they're
not, we've missed the boat.

As for other databases, I suspect that ones that have parallel execution
are probably doing it with a thread model not a process model.

Yeah, maybe. Maybe I'm a hopeless optimist - though I'm rarely
accused of that - I actually think that our process model is going to
work out fairly well for us here. It's true that there is a bunch of
state that needs to be copied around or shared to make this work. But
it's also true that with a thread model, we'd have to explicitly
arrange NOT to share all the things we DIDN'T want shared. Honestly,
that sounds harder.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Tom Lane (#4)

Re: Parallell Optimizer

On 7 June 2013 20:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As for other databases, I suspect that ones that have parallel execution
are probably doing it with a thread model not a process model.

Separate processes are more common because it covers the general case
where query execution is spread across multiple nodes. Threads don't
work across nodes and parallel queries predate (working) threading
models.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fred&Dani&Pandora&Aquiles

fred@nti.ufop.br

almost 13 years ago

In reply to: Robert Haas (#3)

Re: Parallell Optimizer

Hi,

I asked a while ago in this group about the possibility to implement a
parallel planner in a multithread way, and the replies were that the
proposed approach couldn't be implemented, because the postgres is not
thread-safe. With the new feature Background Worker Processes, such
implementation would be possible?

Well, there are versions of genetic algorithms that use the concept of
islands in which the populations evolve in parallel in the different
islands and allows interaction between the islands and so on. I'm working
in an algorithm based on multiagent systems. At the present moment, I mean
in H2, the agents are threads, there are a few locks related to agents
solutions, and a few locks for the best current solution in the environment
where the agents are 'running'. The agents can exchange messages with a
purpose. The environment is shared by the all agents and they use the
environment to get informations from another agents (current solution for
example), tries to update the best current solution and so on.

According with the answers, I think the feature Background Worker Processes
still doesn't meets my needs. So, I'll keep monitoring the progress of this
functionality to implement the planner in future.

Thanks,

Fred

Michael Paquier

michael@paquier.xyz

almost 13 years ago

In reply to: Simon Riggs (#6)

Re: Parallell Optimizer

On Sat, Jun 8, 2013 at 5:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 7 June 2013 20:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As for other databases, I suspect that ones that have parallel execution
are probably doing it with a thread model not a process model.

Separate processes are more common because it covers the general case
where query execution is spread across multiple nodes. Threads don't
work across nodes and parallel queries predate (working) threading
models.

Indeed. Parallelism based on processes would be more convenient for
master-master
type of applications. Even if no master-master feature is implemented
directly in core,
at least a parallelism infrastructure based on processes could be used for
this purpose.
--
Michael

Tatsuo Ishii

t-ishii@sra.co.jp

almost 13 years ago

In reply to: Michael Paquier (#8)

Re: Parallell Optimizer

On Sat, Jun 8, 2013 at 5:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 7 June 2013 20:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As for other databases, I suspect that ones that have parallel execution
are probably doing it with a thread model not a process model.

Separate processes are more common because it covers the general case
where query execution is spread across multiple nodes. Threads don't
work across nodes and parallel queries predate (working) threading
models.

Indeed. Parallelism based on processes would be more convenient for
master-master
type of applications. Even if no master-master feature is implemented
directly in core,
at least a parallelism infrastructure based on processes could be used for
this purpose.

As long as "true" synchronous replication is not implemented in core,
I am not sure there's a value for parallel execution spreading across
multile nodes because of the delay of data update propagation.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Michael Paquier

michael@paquier.xyz

almost 13 years ago

In reply to: Tatsuo Ishii (#9)

Re: Parallell Optimizer

On Tue, Jun 11, 2013 at 9:45 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:

On Sat, Jun 8, 2013 at 5:04 AM, Simon Riggs <simon@2ndquadrant.com>

wrote:

On 7 June 2013 20:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As for other databases, I suspect that ones that have parallel

execution

are probably doing it with a thread model not a process model.

Separate processes are more common because it covers the general case
where query execution is spread across multiple nodes. Threads don't
work across nodes and parallel queries predate (working) threading
models.

Indeed. Parallelism based on processes would be more convenient for
master-master
type of applications. Even if no master-master feature is implemented
directly in core,
at least a parallelism infrastructure based on processes could be used

for

this purpose.

As long as "true" synchronous replication is not implemented in core,
I am not sure there's a value for parallel execution spreading across
multile nodes because of the delay of data update propagation.

True, but we cannot drop the possibility to have such features in the future
either, so a process-based model is safer regarding the possible range of
features and applications we could gain with.
--
Michael

#11

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Tatsuo Ishii (#9)

Re: Parallell Optimizer

On 11 June 2013 01:45, Tatsuo Ishii <ishii@postgresql.org> wrote:

On Sat, Jun 8, 2013 at 5:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 7 June 2013 20:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As for other databases, I suspect that ones that have parallel execution
are probably doing it with a thread model not a process model.

Separate processes are more common because it covers the general case
where query execution is spread across multiple nodes. Threads don't
work across nodes and parallel queries predate (working) threading
models.

Indeed. Parallelism based on processes would be more convenient for
master-master
type of applications. Even if no master-master feature is implemented
directly in core,
at least a parallelism infrastructure based on processes could be used for
this purpose.

As long as "true" synchronous replication is not implemented in core,
I am not sure there's a value for parallel execution spreading across
multile nodes because of the delay of data update propagation.

Please explain what you mean by the word "true" used here.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Fred&Dani&Pandora&Aquiles (#7)

Re: Parallell Optimizer

On 06/10/2013 10:37 PM, Fred&Dani&Pandora&Aquiles wrote:

Hi,

I asked a while ago in this group about the possibility to

implement a

parallel planner in a multithread way, and the replies were

that the

proposed approach couldn't be implemented, because the postgres

is not

thread-safe. With the new feature Background Worker Processes, such
implementation would be possible?

Well, there are versions of genetic algorithms that use the concept of
islands in which the populations evolve in parallel in the different
islands and allows interaction between the islands and so on. I'm
working in an algorithm based on multiagent systems. At the present
moment, I mean in H2, the agents are threads, there are a few locks
related to agents solutions, and a few locks for the best current
solution in the environment where the agents are 'running'. The agents
can exchange messages with a purpose. The environment is shared by the
all agents and they use the environment to get informations from
another agents (current solution for example), tries to update the
best current solution and so on.

If you do this as an academic exercise, then I'd recommend thinking in
"messages" only.

Separate out the message delivery entirely from your core design.

This makes the whole concept much simpler and more generic.

Message delivery can be made almost instantaneous in case of threads
or to take a few tens of microseconds to several seconds
between different physical nodes

Which speed is "fast enough" depends entirely on your query - for a query
running 5 hours on single CPU and 5 minutes on a cluster, message
delay of 50 ms is entirely acceptable

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

#13

Tatsuo Ishii

t-ishii@sra.co.jp

almost 13 years ago

In reply to: Simon Riggs (#11)

Re: Parallell Optimizer

On 11 June 2013 01:45, Tatsuo Ishii <ishii@postgresql.org> wrote:

On Sat, Jun 8, 2013 at 5:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 7 June 2013 20:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As for other databases, I suspect that ones that have parallel execution
are probably doing it with a thread model not a process model.

Separate processes are more common because it covers the general case
where query execution is spread across multiple nodes. Threads don't
work across nodes and parallel queries predate (working) threading
models.

Indeed. Parallelism based on processes would be more convenient for
master-master
type of applications. Even if no master-master feature is implemented
directly in core,
at least a parallelism infrastructure based on processes could be used for
this purpose.

As long as "true" synchronous replication is not implemented in core,
I am not sure there's a value for parallel execution spreading across
multile nodes because of the delay of data update propagation.

Please explain what you mean by the word "true" used here.

In another word, "eager replication".
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tatsuo Ishii

t-ishii@sra.co.jp

almost 13 years ago

In reply to: Michael Paquier (#10)

Re: Parallell Optimizer

On Tue, Jun 11, 2013 at 9:45 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:

On Sat, Jun 8, 2013 at 5:04 AM, Simon Riggs <simon@2ndquadrant.com>

wrote:

On 7 June 2013 20:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As for other databases, I suspect that ones that have parallel

execution

are probably doing it with a thread model not a process model.

Separate processes are more common because it covers the general case
where query execution is spread across multiple nodes. Threads don't
work across nodes and parallel queries predate (working) threading
models.

Indeed. Parallelism based on processes would be more convenient for
master-master
type of applications. Even if no master-master feature is implemented
directly in core,
at least a parallelism infrastructure based on processes could be used

for

this purpose.

As long as "true" synchronous replication is not implemented in core,
I am not sure there's a value for parallel execution spreading across
multile nodes because of the delay of data update propagation.

True, but we cannot drop the possibility to have such features in the future
either, so a process-based model is safer regarding the possible range of
features and applications we could gain with.

I wonder why "true" synchronous replication nor "eager replication"
are not in the developer TODO list. If we want them in the future,
they should be on it.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 13 years ago

In reply to: Tom Lane (#4)

Re: Parallell Optimizer

On 6/7/13 2:23 PM, Tom Lane wrote:

As for other databases, I suspect that ones that have parallel execution
are probably doing it with a thread model not a process model.

Oracle 9i was multi-process, not multi-threaded. IIRC it actually had dedicated IO processes too; backends didn't do their own IO.

We certainly need to protect the use case of queries that run in milliseconds, and clearly parallelism won't help there at all. But we can't ignore the other end of the spectrum; you'd need a LOT of communication overhead to swamp the benefits of parallel execution on a multi-minute, CPU-bound query (or in many cases even IO bound).
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Gavin Flower

GavinFlower@archidevsys.co.nz

almost 13 years ago

In reply to: Hannu Krosing (#12)

Re: Parallell Optimizer

On 11/06/13 19:24, Hannu Krosing wrote:

On 06/10/2013 10:37 PM, Fred&Dani&Pandora&Aquiles wrote:

Hi,

I asked a while ago in this group about the possibility to

implement a

parallel planner in a multithread way, and the replies were

that the

proposed approach couldn't be implemented, because the

postgres is not

thread-safe. With the new feature Background Worker Processes,

such

implementation would be possible?

Well, there are versions of genetic algorithms that use the concept
of islands in which the populations evolve in parallel in the
different islands and allows interaction between the islands and so
on. I'm working in an algorithm based on multiagent systems. At the
present moment, I mean in H2, the agents are threads, there are a few
locks related to agents solutions, and a few locks for the best
current solution in the environment where the agents are 'running'.
The agents can exchange messages with a purpose. The environment is
shared by the all agents and they use the environment to get
informations from another agents (current solution for example),
tries to update the best current solution and so on.

If you do this as an academic exercise, then I'd recommend thinking in
"messages" only.

Separate out the message delivery entirely from your core design.

This makes the whole concept much simpler and more generic.

Message delivery can be made almost instantaneous in case of threads
or to take a few tens of microseconds to several seconds
between different physical nodes

Which speed is "fast enough" depends entirely on your query - for a query
running 5 hours on single CPU and 5 minutes on a cluster, message
delay of 50 ms is entirely acceptable
--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

I suspect (from my position of almost total ignorance of this area!)
that once a generic method works independently of how closely coupled
the different parallel parts are, then a later optimisation could be
added dependent on how the parts were related. So running on a multi
core chip could have a different communication system to that running
across multiple computer geographically dispersed. Thogh in practice, I
suspect that bthe most common use case would involve many processor
chips in the same 'box' (even if said box was distributed across a large
room!).

Anyhow, I think that separating out how to effectively parallelise
Postgres from how the parts communicate is a Good Thing (TM). Though
knowing Grim Reality, it is bound to b e more complicated in Reality!
:-( As the useful size of work of the parallel units obviously does
relate to the communication overhead.

Possibly the biggest challenge will be in devising a planning
methodology that can efficiently decide on an appropriate parallel
strategy. Maybe a key word to tell the planner that you know this is a
very big query and you don't mind it taking a long to come up with a
decent plan? The planner would need to know details of the processing
unit topology, communication overheads, and possibly other details - to
make a really effective plan in the distributed case.

My mind boggles, just thinking of the number of different variables that
might be required to create an 'optimal' plan for parallel processing in
a distributed system!

Cheers,
Gavin

#17

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Tatsuo Ishii (#13)

Re: Parallell Optimizer

On 06/11/2013 04:53 PM, Tatsuo Ishii wrote:

On 11 June 2013 01:45, Tatsuo Ishii <ishii@postgresql.org> wrote:

On Sat, Jun 8, 2013 at 5:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 7 June 2013 20:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As for other databases, I suspect that ones that have parallel execution
are probably doing it with a thread model not a process model.

Separate processes are more common because it covers the general case
where query execution is spread across multiple nodes. Threads don't
work across nodes and parallel queries predate (working) threading
models.

Indeed. Parallelism based on processes would be more convenient for
master-master
type of applications. Even if no master-master feature is implemented
directly in core,
at least a parallelism infrastructure based on processes could be used for
this purpose.

As long as "true" synchronous replication is not implemented in core,
I am not sure there's a value for parallel execution spreading across
multile nodes because of the delay of data update propagation.

Please explain what you mean by the word "true" used here.

In another word, "eager replication".

Do you mean something along these lines :

"Most synchronous or eager replication solutions do conflict prevention,
while asynchronous solutions have to do conflict resolution. For instance,
if a record is changed on two nodes simultaneously, an eager replication
system would detect the conflict before confirming the commit and abort
one of the transactions. A lazy replication system would allow both
transactions to commit and run a conflict resolution during
resynchronization. "

IMO it is possible to do this "easily" once BDR has reached the state
where you
can do streaming apply. That is, you replay actions on other hosts as they
are logged, not after the transaction commits. Doing it this way you can
wait
any action to successfully complete a full circle before committing it
in source.

Currently main missing part in doing this is autonomous transactions.
It can in theory be done by opening an extra backend for each incoming
transaction but you will need really big number of backends and also you
have extra overhead from interprocess communications.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tatsuo Ishii

t-ishii@sra.co.jp

almost 13 years ago

In reply to: Hannu Krosing (#17)

Re: Parallell Optimizer

Please explain what you mean by the word "true" used here.

In another word, "eager replication".

Do you mean something along these lines :

"Most synchronous or eager replication solutions do conflict prevention,
while asynchronous solutions have to do conflict resolution. For instance,
if a record is changed on two nodes simultaneously, an eager replication
system would detect the conflict before confirming the commit and abort
one of the transactions. A lazy replication system would allow both
transactions to commit and run a conflict resolution during
resynchronization. "

?

No, I'm not talking about conflict resolution.

From http://www.cs.cmu.edu/~natassa/courses/15-823/F02/papers/replication.pdf:
----------------------------------------------
Eager or Lazy Replication?
Eager replication:
keep all replicas synchronized by updating all
replicas in a single transaction

Lazy replication:
asynchronously propagate replica updates to
other nodes after replicating transaction commits
----------------------------------------------

Parallel query execution needs to assume that each node synchronized
in a commit, otherwise the summary of each query result executed on
each node is meaningless.

IMO it is possible to do this "easily" once BDR has reached the state
where you
can do streaming apply.
That is, you replay actions on other hosts as they
are logged, not after the transaction commits. Doing it this way you can
wait
any action to successfully complete a full circle before committing it
in source.

Currently main missing part in doing this is autonomous transactions.
It can in theory be done by opening an extra backend for each incoming
transaction but you will need really big number of backends and also you
have extra overhead from interprocess communications.

Thanks for a thought about the conflict resolution in BDR.

BTW, if we seriously think about implementing the parallel query
execution, we need to find a way to distribute data among each node,
that requires partial copy of table. I thinl that would a big
challenge for WAL based replication.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Tatsuo Ishii (#18)

Re: Parallell Optimizer

On 06/12/2013 01:01 AM, Tatsuo Ishii wrote:

Please explain what you mean by the word "true" used here.

In another word, "eager replication".

Do you mean something along these lines :

"Most synchronous or eager replication solutions do conflict prevention,
while asynchronous solutions have to do conflict resolution. For instance,
if a record is changed on two nodes simultaneously, an eager replication
system would detect the conflict before confirming the commit and abort
one of the transactions. A lazy replication system would allow both
transactions to commit and run a conflict resolution during
resynchronization. "

?

No, I'm not talking about conflict resolution.

From http://www.cs.cmu.edu/~natassa/courses/15-823/F02/papers/replication.pdf:
----------------------------------------------
Eager or Lazy Replication?
Eager replication:
keep all replicas synchronized by updating all
replicas in a single transaction

Ok, so you are talking about distributed transactions ?

In our current master-slave replication, how would it be different from
current synchronous replication ?

Or does it make sense only in case of multimaster replication ?

The main problems with "keep all replicas synchronized by updating all
replicas in a single transaction"
are performance and reliability.

That is, the write performance has to be less than for single server and
failure of a single replica brings down the whole cluster.

Lazy replication:
asynchronously propagate replica updates to
other nodes after replicating transaction commits
----------------------------------------------

Parallel query execution needs to assume that each node synchronized
in a commit, otherwise the summary of each query result executed on
each node is meaningless.

IMO it is possible to do this "easily" once BDR has reached the state
where you
can do streaming apply.
That is, you replay actions on other hosts as they
are logged, not after the transaction commits. Doing it this way you can
wait
any action to successfully complete a full circle before committing it
in source.

Currently main missing part in doing this is autonomous transactions.
It can in theory be done by opening an extra backend for each incoming
transaction but you will need really big number of backends and also you
have extra overhead from interprocess communications.

Thanks for a thought about the conflict resolution in BDR.

BTW, if we seriously think about implementing the parallel query
execution, we need to find a way to distribute data among each node,
that requires partial copy of table. I thinl that would a big
challenge for WAL based replication.

Moving partial query results around is completely different problem from
replication.

We should not mix these.

If on the other hand think about sharding (that is having a table
partitioned
between nodes) then this can be done in BDR.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Tatsuo Ishii

t-ishii@sra.co.jp

almost 13 years ago

In reply to: Hannu Krosing (#19)

Re: Parallell Optimizer

No, I'm not talking about conflict resolution.

From http://www.cs.cmu.edu/~natassa/courses/15-823/F02/papers/replication.pdf:
----------------------------------------------
Eager or Lazy Replication?
Eager replication:
keep all replicas synchronized by updating all
replicas in a single transaction

Ok, so you are talking about distributed transactions ?

In our current master-slave replication, how would it be different from
current synchronous replication ?

Or does it make sense only in case of multimaster replication ?

The main problems with "keep all replicas synchronized by updating all
replicas in a single transaction"
are performance and reliability.

That is, the write performance has to be less than for single server

That's just a log based replication's specific limitation. It needs to
wait for log replay, which is virtually same as a cluster wide giant
lock. On the other hand, non log based replication systems (if my
understanding is correct, Postgres-XC is the case) could perform
better than single server.

and
failure of a single replica brings down the whole cluster.

That's a price of "eager replication". However it could be mitigated
by using existing HA technologies.

Lazy replication:
asynchronously propagate replica updates to
other nodes after replicating transaction commits
----------------------------------------------

Parallel query execution needs to assume that each node synchronized
in a commit, otherwise the summary of each query result executed on
each node is meaningless.

IMO it is possible to do this "easily" once BDR has reached the state
where you
can do streaming apply.
That is, you replay actions on other hosts as they
are logged, not after the transaction commits. Doing it this way you can
wait
any action to successfully complete a full circle before committing it
in source.

Currently main missing part in doing this is autonomous transactions.
It can in theory be done by opening an extra backend for each incoming
transaction but you will need really big number of backends and also you
have extra overhead from interprocess communications.

Thanks for a thought about the conflict resolution in BDR.

BTW, if we seriously think about implementing the parallel query
execution, we need to find a way to distribute data among each node,
that requires partial copy of table. I thinl that would a big
challenge for WAL based replication.

Moving partial query results around is completely different problem from
replication.

We should not mix these.

I just explained why log based replication could not be a
infrastructure for the parallel query execution. One reason is "lazy
replication", the other is the ability of partial copy.

If on the other hand think about sharding (that is having a table
partitioned
between nodes) then this can be done in BDR.

Ok, I didn't know that BRD can do it.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21