No title

Started by Masahiko Sawadaabout 9 years ago2 messages

sawada.mshk@gmail.com

about 9 years ago

On Fri, Oct 21, 2016 at 2:38 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Oct 19, 2016 at 9:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

However, when I briefly read the description in "Transaction Management in
the R* Distributed Database Management System (C. Mohan et al)" [2], it
seems that what Ashutosh is saying might be a correct way to proceed after
all:

I think Ashutosh is mostly right, but I think there's a lot of room to
doubt whether the design of this patch is good enough that we should
adopt it.

Consider two possible designs. In design #1, the leader performs the
commit locally and then tries to send COMMIT PREPARED to every standby
server afterward, and only then acknowledges the commit to the client.
In design #2, the leader performs the commit locally and then
acknowledges the commit to the client at once, leaving the task of
running COMMIT PREPARED to some background process. Design #2
involves a race condition, because it's possible that the background
process might not complete COMMIT PREPARED on every node before the
user submits the next query, and that query might then fail to see
supposedly-committed changes. This can't happen in design #1. On the
other hand, there's always the possibility that the leader's session
is forcibly killed, even perhaps by pulling the plug. If the
background process contemplated by design #2 is well-designed, it can
recover and finish sending COMMIT PREPARED to each relevant server
after the next restart. In design #1, that background process doesn't
necessarily exist, so inevitably there is a possibility of orphaning
prepared transactions on the remote servers, which is not good. Even
if the DBA notices them, it won't be easy to figure out whether to
commit them or roll them back.

I think this thought experiment shows that, on the one hand, there is
a point to waiting for commits on the foreign servers, because it can
avoid the anomaly of not seeing the effects of your own commits. On
the other hand, it's ridiculous to suppose that every case can be
handled by waiting, because that just isn't true. You can't be sure
that you'll be able to wait long enough for COMMIT PREPARED to
complete, and even if that works out, you may not want to wait
indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has
no value whatsoever unless the system design is such that failing to
wait for it results in the ROLLBACK PREPARED never getting performed
-- which is a pretty poor excuse.

Moreover, there are good reasons to think that doing this kind of
cleanup work in the post-commit hooks is never going to be acceptable.
Generally, the post-commit hooks need to be no-fail, because it's too
late to throw an ERROR. But there's very little hope that a
connection to a remote server can be no-fail; anything that involves a
network connection is, by definition, prone to failure. We can try to
guarantee that every single bit of code that runs in the path that
sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an
ERROR, but that's going to be quite difficult to do: even palloc() can
throw an error. And what about interrupts? We don't want to be stuck
inside this code for a long time without any hope of the user
recovering control of the session by pressing ^C, but of course the
way that works is it throws an ERROR, which we can't handle here. We
fixed a similar issue for synchronous replication in
9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a
WARNING in that case and then return control to the user. But if we
do that here, all of the code that every FDW emits has to be aware of
that rule and follow it, and it just adds to the list of ways that the
user backend can escape this code without having cleaned up all of the
prepared transactions on the remote side.

Hmm, IIRC, my patch and possibly patch by Masahiko-san and Vinayak,
tries to resolve prepared transactions in post-commit code. I agree
with you here, that it should be avoided and the backend should take
over the job of resolving transactions.

It seems to me that the only way to really make this feature robust is
to have a background worker as part of the equation. The background
worker launches at startup and looks around for local state that tells
it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
operations pending that weren't completed during the last server
lifetime, whether because of a local crash or remote unavailability.
It attempts to complete those and retries periodically. When a new
transaction needs this type of coordination, it adds the necessary
crash-proof state and then signals the background worker. If
appropriate, it can wait for the background worker to complete, just
like a CHECKPOINT waits for the checkpointer to finish -- but if the
CHECKPOINT command is interrupted, the actual checkpoint is
unaffected.

My patch and hence patch by Masahiko-san and Vinayak have the
background worker in the equation. The background worker tries to
resolve prepared transactions on the foreign server periodically.
IIRC, sending it a signal when another backend creates foreign
prepared transactions is not implemented. That may be a good addition.

More broadly, the question has been raised as to whether it's right to
try to handle atomic commit and atomic visibility as two separate
problems. The XTM API proposed by Postgres Pro aims to address both
with a single stroke. I don't think that API was well-designed, but
maybe the idea is good even if the code is not. Generally, there are
two ways in which you could imagine that a distributed version of
PostgreSQL might work. One possibility is that one node makes
everything work by going around and giving instructions to the other
nodes, which are more or less unaware that they are part of a cluster.
That is basically the design of Postgres-XC and certainly the design
being proposed here. The other possibility is that the nodes are
actually clustered in some way and agree on things like whether a
transaction committed or what snapshot is current using some kind of
consensus protocol. It is obviously possible to get a fairly long way
using the first approach but it seems likely that the second one is
fundamentally more powerful: among other things, because the first
approach is so centralized, the leader is apt to become a bottleneck.
And, quite apart from that, can a centralized architecture with the
leader manipulating the other workers ever allow for atomic
visibility? If atomic visibility can build on top of atomic commit,
then it makes sense to do atomic commit first, but if we build this
infrastructure and then find that we need an altogether different
solution for atomic visibility, that will be unfortunate.

There are two problems to solve as far as visibility is concerned. 1.
Consistency: changes by which transactions are visible to a given
transaction 2. Making visible, the changes by all the segments of a
given distributed transaction on different foreign servers, at the
same time IOW no other transaction sees changes by only few segments
but does not see changes by all the transactions.

First problem is hard to solve and there are many consistency
symantics. A large topic of discussion.

The second problem can be solved on top of this infrastructure by
extending PREPARE transaction API. I am writing down my ideas so that
they don't get lost. It's not a completed design.

Assume that we have syntax which tells the originating server which
prepared the transaction. PREPARE TRANSACTION <GID> FOR SERVER <local
server name> with ID <xid> ,where xid is the transaction identifier on
local server. OR we may incorporate that information in GID itself and
the foreign server knows how to decode it.

Once we have that information, the foreign server can actively poll
the local server to get the status of transaction xid and resolves the
prepared transaction itself. It can go a step further and inform the
local server that it has resolved the transaction, so that the local
server can purge it from it's own state. It can remember the fate of
xid, which can be consulted by another foreign server if the local
server is down. If another transaction on the foreign server stumbles
on a transaction prepared (but not resolved) by the local server,
foreign server has two options - 1. consult the local server and
resolve 2. if the first options fails to get the status of xid or that
if that option is not workable, throw an error e.g. indoubt
transaction. There is probably more network traffic happening here.
Usually, the local server should be able to resolve the transaction
before any other transaction stumbles upon it. The overhead is
incurred only when necessary.

I think we can consider the atomic commit and the atomic visibility
separately, and the atomic visibility can build on the top of the
atomic commit. We can't provide the atomic visibility across multiple
nodes without consistent update. So I'd like to focus on atomic commit
in this thread. Considering to providing the atomic commit, the two
phase commit protocol is the perfect solution for providing atomic
commit. Whatever type of solution for atomic visibility we have, the
atomic commit by 2PC is necessary feature. We can consider to have the
atomic commit feature that ha following functionalities.
* The local node is responsible for the transaction management among
relevant remote servers using 2PC.
* The local node has information about the state of distributed
transaction state.
* There is a process resolving in-doubt transaction.

As Ashutosh mentioned, current patch supports almost these
functionalities. But I'm trying to update it so that it can have
multiple foreign server information into one FDWXact file, one entry
on shared buffer. Because in spite of that new remote server can be
added on the fly, we could need to restart local server in order to
allocate the more large shared buffer for fdw transaction whenever
remote server is added. Also I'm incorporating other comments.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#1)

Re:

On Wed, Oct 26, 2016 at 2:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Oct 21, 2016 at 2:38 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Oct 19, 2016 at 9:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

However, when I briefly read the description in "Transaction Management in
the R* Distributed Database Management System (C. Mohan et al)" [2], it
seems that what Ashutosh is saying might be a correct way to proceed after
all:

I think Ashutosh is mostly right, but I think there's a lot of room to
doubt whether the design of this patch is good enough that we should
adopt it.

Consider two possible designs. In design #1, the leader performs the
commit locally and then tries to send COMMIT PREPARED to every standby
server afterward, and only then acknowledges the commit to the client.
In design #2, the leader performs the commit locally and then
acknowledges the commit to the client at once, leaving the task of
running COMMIT PREPARED to some background process. Design #2
involves a race condition, because it's possible that the background
process might not complete COMMIT PREPARED on every node before the
user submits the next query, and that query might then fail to see
supposedly-committed changes. This can't happen in design #1. On the
other hand, there's always the possibility that the leader's session
is forcibly killed, even perhaps by pulling the plug. If the
background process contemplated by design #2 is well-designed, it can
recover and finish sending COMMIT PREPARED to each relevant server
after the next restart. In design #1, that background process doesn't
necessarily exist, so inevitably there is a possibility of orphaning
prepared transactions on the remote servers, which is not good. Even
if the DBA notices them, it won't be easy to figure out whether to
commit them or roll them back.

I think this thought experiment shows that, on the one hand, there is
a point to waiting for commits on the foreign servers, because it can
avoid the anomaly of not seeing the effects of your own commits. On
the other hand, it's ridiculous to suppose that every case can be
handled by waiting, because that just isn't true. You can't be sure
that you'll be able to wait long enough for COMMIT PREPARED to
complete, and even if that works out, you may not want to wait
indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has
no value whatsoever unless the system design is such that failing to
wait for it results in the ROLLBACK PREPARED never getting performed
-- which is a pretty poor excuse.

Moreover, there are good reasons to think that doing this kind of
cleanup work in the post-commit hooks is never going to be acceptable.
Generally, the post-commit hooks need to be no-fail, because it's too
late to throw an ERROR. But there's very little hope that a
connection to a remote server can be no-fail; anything that involves a
network connection is, by definition, prone to failure. We can try to
guarantee that every single bit of code that runs in the path that
sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an
ERROR, but that's going to be quite difficult to do: even palloc() can
throw an error. And what about interrupts? We don't want to be stuck
inside this code for a long time without any hope of the user
recovering control of the session by pressing ^C, but of course the
way that works is it throws an ERROR, which we can't handle here. We
fixed a similar issue for synchronous replication in
9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a
WARNING in that case and then return control to the user. But if we
do that here, all of the code that every FDW emits has to be aware of
that rule and follow it, and it just adds to the list of ways that the
user backend can escape this code without having cleaned up all of the
prepared transactions on the remote side.

Hmm, IIRC, my patch and possibly patch by Masahiko-san and Vinayak,
tries to resolve prepared transactions in post-commit code. I agree
with you here, that it should be avoided and the backend should take
over the job of resolving transactions.

It seems to me that the only way to really make this feature robust is
to have a background worker as part of the equation. The background
worker launches at startup and looks around for local state that tells
it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
operations pending that weren't completed during the last server
lifetime, whether because of a local crash or remote unavailability.
It attempts to complete those and retries periodically. When a new
transaction needs this type of coordination, it adds the necessary
crash-proof state and then signals the background worker. If
appropriate, it can wait for the background worker to complete, just
like a CHECKPOINT waits for the checkpointer to finish -- but if the
CHECKPOINT command is interrupted, the actual checkpoint is
unaffected.

My patch and hence patch by Masahiko-san and Vinayak have the
background worker in the equation. The background worker tries to
resolve prepared transactions on the foreign server periodically.
IIRC, sending it a signal when another backend creates foreign
prepared transactions is not implemented. That may be a good addition.

More broadly, the question has been raised as to whether it's right to
try to handle atomic commit and atomic visibility as two separate
problems. The XTM API proposed by Postgres Pro aims to address both
with a single stroke. I don't think that API was well-designed, but
maybe the idea is good even if the code is not. Generally, there are
two ways in which you could imagine that a distributed version of
PostgreSQL might work. One possibility is that one node makes
everything work by going around and giving instructions to the other
nodes, which are more or less unaware that they are part of a cluster.
That is basically the design of Postgres-XC and certainly the design
being proposed here. The other possibility is that the nodes are
actually clustered in some way and agree on things like whether a
transaction committed or what snapshot is current using some kind of
consensus protocol. It is obviously possible to get a fairly long way
using the first approach but it seems likely that the second one is
fundamentally more powerful: among other things, because the first
approach is so centralized, the leader is apt to become a bottleneck.
And, quite apart from that, can a centralized architecture with the
leader manipulating the other workers ever allow for atomic
visibility? If atomic visibility can build on top of atomic commit,
then it makes sense to do atomic commit first, but if we build this
infrastructure and then find that we need an altogether different
solution for atomic visibility, that will be unfortunate.

There are two problems to solve as far as visibility is concerned. 1.
Consistency: changes by which transactions are visible to a given
transaction 2. Making visible, the changes by all the segments of a
given distributed transaction on different foreign servers, at the
same time IOW no other transaction sees changes by only few segments
but does not see changes by all the transactions.

First problem is hard to solve and there are many consistency
symantics. A large topic of discussion.

The second problem can be solved on top of this infrastructure by
extending PREPARE transaction API. I am writing down my ideas so that
they don't get lost. It's not a completed design.

Assume that we have syntax which tells the originating server which
prepared the transaction. PREPARE TRANSACTION <GID> FOR SERVER <local
server name> with ID <xid> ,where xid is the transaction identifier on
local server. OR we may incorporate that information in GID itself and
the foreign server knows how to decode it.

Once we have that information, the foreign server can actively poll
the local server to get the status of transaction xid and resolves the
prepared transaction itself. It can go a step further and inform the
local server that it has resolved the transaction, so that the local
server can purge it from it's own state. It can remember the fate of
xid, which can be consulted by another foreign server if the local
server is down. If another transaction on the foreign server stumbles
on a transaction prepared (but not resolved) by the local server,
foreign server has two options - 1. consult the local server and
resolve 2. if the first options fails to get the status of xid or that
if that option is not workable, throw an error e.g. indoubt
transaction. There is probably more network traffic happening here.
Usually, the local server should be able to resolve the transaction
before any other transaction stumbles upon it. The overhead is
incurred only when necessary.

I think we can consider the atomic commit and the atomic visibility
separately, and the atomic visibility can build on the top of the
atomic commit. We can't provide the atomic visibility across multiple
nodes without consistent update. So I'd like to focus on atomic commit
in this thread. Considering to providing the atomic commit, the two
phase commit protocol is the perfect solution for providing atomic
commit. Whatever type of solution for atomic visibility we have, the
atomic commit by 2PC is necessary feature. We can consider to have the
atomic commit feature that ha following functionalities.
* The local node is responsible for the transaction management among
relevant remote servers using 2PC.
* The local node has information about the state of distributed
transaction state.
* There is a process resolving in-doubt transaction.

As Ashutosh mentioned, current patch supports almost these
functionalities. But I'm trying to update it so that it can have
multiple foreign server information into one FDWXact file, one entry
on shared buffer. Because in spite of that new remote server can be
added on the fly, we could need to restart local server in order to
allocate the more large shared buffer for fdw transaction whenever
remote server is added. Also I'm incorporating other comments.

The subject line is removed mistakenly.
Please ignore it.
Sorry for the noise.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers