Transactions involving multiple postgres foreign servers
Hi All,
While looking at the patch for supporting inheritance on foreign tables, I
noticed that if a transaction makes changes to more than two foreign
servers the current implementation in postgres_fdw doesn't make sure that
either all of them rollback or all of them commit their changes, IOW there
is a possibility that some of them commit their changes while others
rollback theirs.
PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback() at
XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the
foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT
event it commits or aborts those transactions resp.
The logic to craft the prepared transaction ids is rudimentary and I am
open to suggestions for the same. I have following goals in mind while
crafting the transaction ids
1. Minimize the chances of crafting a transaction id which would conflict
with a concurrent transaction id on that foreign server.
2. Because of a limitation described later, DBA/user should be able to
identify the server which originated a remote transaction.
More can be found in comments above function pgfdw_get_prep_xact_id() in
the patch.
Limitations
---------------
1. After a transaction has been prepared on foreign server, if the
connection to that server is lost before the transaction is rolled back or
committed on that server, the transaction remains in prepared state
forever. Manual intervention would be needed to clean up such a transaction
(Hence the goal 2 above). Automating this process would require significant
changes to the transaction manager, so, left out of this patch, which I
thought would be better right now. If required, I can work on that part in
this patch itself.
2. 2PC is needed only when there are more than two foreign servers involved
in a transaction. Transactions on a single foreign server are handled well
right now. So, ideally, the code should detect if there are more than two
foreign server are involved in the transaction and only then use 2PC. But I
couldn't find a way to detect that without changing the transaction manager.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachments:
pg_fdw_transact.patchtext/x-patch; charset=US-ASCII; name=pg_fdw_transact.patchDownload+298-10
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
While looking at the patch for supporting inheritance on foreign tables, I
noticed that if a transaction makes changes to more than two foreign
servers the current implementation in postgres_fdw doesn't make sure that
either all of them rollback or all of them commit their changes, IOW there
is a possibility that some of them commit their changes while others
rollback theirs.
PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback() at
XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the
foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT
event it commits or aborts those transactions resp.
TBH, I think this is a pretty awful idea.
In the first place, this does little to improve the actual reliability
of a commit occurring across multiple foreign servers; and in the second
place it creates a bunch of brand new failure modes, many of which would
require manual DBA cleanup.
The core of the problem is that this doesn't have anything to do with
2PC as it's commonly understood: for that, you need a genuine external
transaction manager that is aware of all the servers involved in a
transaction, and has its own persistent state (or at least a way to
reconstruct its own state by examining the per-server states).
This patch is not that; in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.
As far as failure modes go, one basic reason why this cannot work as
presented is that the remote servers may not even have prepared
transaction support enabled (in fact max_prepared_transactions = 0
is the default in all supported PG versions). So this would absolutely
have to be a not-on-by-default option. But the bigger issue is that
leaving it to the DBA to clean up after failures is not a production
grade solution, *especially* not for prepared transactions, which are
performance killers if not closed out promptly. So I can't imagine
anyone wanting to turn this on without a more robust answer than that.
Basically I think what you'd need for this to be a credible patch would be
for it to work by changing the behavior only in the PREPARE TRANSACTION
path: rather than punting as we do now, prepare the remote transactions,
and report their server identities and gids to an external transaction
manager, which would then be responsible for issuing the actual commits
(along with the actual commit of the local transaction). I have no idea
whether it's feasible to do that without having to assume a particular
2PC transaction manager API/implementation.
It'd be interesting to hear from people who are using 2PC in production
to find out if this would solve any real-world problems for them, and
what the details of the TM interface would need to look like to make it
work in practice.
In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jan 2, 2015 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.
I was involved in some internal discussions related to this patch, so
I have some opinions on it. The long-term, high-level goal here is to
facilitate sharding. If we've got a bunch of PostgreSQL servers
interlinked via postgres_fdw, it should be possible to perform
transactions on the cluster in such a way that transactions are just
as atomic, consistent, isolated, and durable as they would be with
just one server. As far as I know, there is no way to achieve this
goal through the use of an external transaction manager, because even
if that external transaction manager guarantees, for every
transaction, that the transaction either commits on all nodes or rolls
back on all nodes, there's no way for it to guarantee that other
transactions won't see some intermediate state where the commit has
been completed on some nodes but not others. To get that, you need
some of integration that reaches down to the way snapshots are taken.
I think, though, that it might be worthwhile to first solve the
simpler problem of figuring out how to ensure that a transaction
commits everywhere or rolls back everywhere, even if intermediate
states might still be transiently visible. I don't think this patch,
as currently designed, is equal to that challenge, because
XACT_EVENT_PRE_COMMIT fires before the transaction is certain to
commit - PreCommit_CheckForSerializationFailure or PreCommit_Notify
could still error out. We could have a hook that fires after that,
but that doesn't solve the problem if a user of that hook can itself
throw an error. Even if part of the API contract is that it's not
allowed to do so, the actual attempt to commit the change on the
remote side can fail due to - e.g. - a network interruption, and
that's go to be dealt with somehow.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
I was involved in some internal discussions related to this patch, so
I have some opinions on it. The long-term, high-level goal here is to
facilitate sharding. If we've got a bunch of PostgreSQL servers
interlinked via postgres_fdw, it should be possible to perform
transactions on the cluster in such a way that transactions are just
as atomic, consistent, isolated, and durable as they would be with
just one server. As far as I know, there is no way to achieve this
goal through the use of an external transaction manager, because even
if that external transaction manager guarantees, for every
transaction, that the transaction either commits on all nodes or rolls
back on all nodes, there's no way for it to guarantee that other
transactions won't see some intermediate state where the commit has
been completed on some nodes but not others. To get that, you need
some of integration that reaches down to the way snapshots are taken.
That's a laudable goal, but I would bet that nothing built on the FDW
infrastructure will ever get there. Certainly the proposed patch doesn't
look like it moves us very far towards that set of goalposts.
I think, though, that it might be worthwhile to first solve the
simpler problem of figuring out how to ensure that a transaction
commits everywhere or rolls back everywhere, even if intermediate
states might still be transiently visible.
Perhaps. I suspect that it might still be a dead end if the ultimate
goal is cross-system atomic commit ... but likely it would teach us
some useful things anyway.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jan 5, 2015 at 2:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
That's a laudable goal, but I would bet that nothing built on the FDW
infrastructure will ever get there.
Why?
It would be surprising to me if, given that we have gone to some pains
to create a system that allows cross-system queries, and hopefully
eventually pushdown of quals, joins, and aggregates, we then made
sharding work in some completely different way that reuses none of
that infrastructure. But maybe I am looking at this the wrong way.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Jan 5, 2015 at 2:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
That's a laudable goal, but I would bet that nothing built on the FDW
infrastructure will ever get there.
Why?
It would be surprising to me if, given that we have gone to some pains
to create a system that allows cross-system queries, and hopefully
eventually pushdown of quals, joins, and aggregates, we then made
sharding work in some completely different way that reuses none of
that infrastructure. But maybe I am looking at this the wrong way.
Well, we intentionally didn't couple the FDW stuff closely into
transaction commit, because of the thought that the "far end" would not
necessarily have Postgres-like transactional behavior, and even if it did
there would be about zero chance of having atomic commit with a
non-Postgres remote server. postgres_fdw is a seriously bad starting
point as far as that goes, because it encourages one to make assumptions
that can't possibly work for any other wrapper.
I think the idea I sketched upthread of supporting an external transaction
manager might be worth pursuing, in that it would potentially lead to
having at least an approximation of atomic commit across heterogeneous
servers.
Independently of that, I think what you are talking about would be better
addressed outside the constraints of the FDW mechanism. That's not to say
that we couldn't possibly make postgres_fdw use some additional non-FDW
infrastructure to manage commits; just that solving this in terms of the
FDW infrastructure seems wrongheaded to me.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jan 5, 2015 at 3:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Well, we intentionally didn't couple the FDW stuff closely into
transaction commit, because of the thought that the "far end" would not
necessarily have Postgres-like transactional behavior, and even if it did
there would be about zero chance of having atomic commit with a
non-Postgres remote server. postgres_fdw is a seriously bad starting
point as far as that goes, because it encourages one to make assumptions
that can't possibly work for any other wrapper.
Atomic commit is something that can potentially be supported by many
different FDWs, as long as the thing on the other end supports 2PC.
If you're talking to Oracle or DB2 or SQL Server, and it supports 2PC,
then you can PREPARE the transaction and then go back and COMMIT the
transaction once it's committed locally. Getting a cluster-wide
*snapshot* is probably a PostgreSQL-only thing requiring much deeper
integration, but I think it would be sensible to leave that as a
future project and solve the simpler problem first.
I think the idea I sketched upthread of supporting an external transaction
manager might be worth pursuing, in that it would potentially lead to
having at least an approximation of atomic commit across heterogeneous
servers.
An important threshold question here is whether we want to rely on an
external transaction manager, or build one into PostgreSQL. As far as
this particular project goes, there's nothing that can't be done
inside PostgreSQL. You need a durable registry of which transactions
you prepared on which servers, and which XIDs they correlate to. If
you have that, then you can use background workers or similar to go
retry commits or rollbacks of prepared transactions until it works,
even if there's been a local crash meanwhile.
Alternatively, you could rely on an external transaction manager to do
all that stuff. I don't have a clear sense of what that would entail,
or how it might be better or worse than rolling our own. I suspect,
though, that it might amount to little more than adding a middle man.
I mean, a third-party transaction manager isn't going to automatically
know how to commit a transaction prepared on some foreign server using
some foreign data wrapper. It's going to be have to be taught that if
postgres_fdw leaves a transaction in-medias-res on server OID 1234,
you've got to connect to the target machine using that foreign
server's connection parameters, speak libpq, and issue the appropriate
COMMIT TRANSACTION command. And similarly, you're going to need to
arrange to notify it before preparing that transaction so that it
knows that it needs to request the COMMIT or ABORT later on. Once
you've got all of that infrastructure for that in place, what are you
really gaining over just doing it in PostgreSQL (or, say, a contrib
module thereto)?
(I'm also concerned that an external transaction manager might need
the PostgreSQL client to be aware of it, whereas what we'd really like
here is for the client to just speak PostgreSQL and be happy that its
commits no longer end up half-done.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jan 5, 2015 at 11:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 2, 2015 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.I was involved in some internal discussions related to this patch, so
I have some opinions on it. The long-term, high-level goal here is to
facilitate sharding. If we've got a bunch of PostgreSQL servers
interlinked via postgres_fdw, it should be possible to perform
transactions on the cluster in such a way that transactions are just
as atomic, consistent, isolated, and durable as they would be with
just one server. As far as I know, there is no way to achieve this
goal through the use of an external transaction manager, because even
if that external transaction manager guarantees, for every
transaction, that the transaction either commits on all nodes or rolls
back on all nodes, there's no way for it to guarantee that other
transactions won't see some intermediate state where the commit has
been completed on some nodes but not others. To get that, you need
some of integration that reaches down to the way snapshots are taken.I think, though, that it might be worthwhile to first solve the
simpler problem of figuring out how to ensure that a transaction
commits everywhere or rolls back everywhere, even if intermediate
states might still be transiently visible.
Agreed.
I don't think this patch,
as currently designed, is equal to that challenge, because
XACT_EVENT_PRE_COMMIT fires before the transaction is certain to
commit - PreCommit_CheckForSerializationFailure or PreCommit_Notify
could still error out. We could have a hook that fires after that,
but that doesn't solve the problem if a user of that hook can itself
throw an error. Even if part of the API contract is that it's not
allowed to do so, the actual attempt to commit the change on the
remote side can fail due to - e.g. - a network interruption, and
that's go to be dealt with somehow.
Tom mentioned
--
in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.
--
You have given a specific example of this case. So, let me dry run through
CommitTransaction() after applying my patch.
1899 CallXactCallbacks(XACT_EVENT_PRE_COMMIT);
While processing this event in postgres_fdw's callback
pgfdw_xact_callback() sends a PREPARE TRANSACTION to all the foreign
servers involved. These servers return with their success or failures. Even
if one of them fails, the local transaction is aborted along-with all the
prepared transactions. Only if all the foreign servers succeed we proceed
further.
1925 PreCommit_CheckForSerializationFailure();
1926
1932 PreCommit_Notify();
1933
If any of these function (as you mentioned above), throws errors, the local
transaction will be aborted as well as the remote prepared transactions.
Note, that we haven't yet committed the local transaction (which will be
done below) and also not the remote transactions which are in PREPAREd
state there. Since all the transactions local as well as remote are aborted
in case of error, the data is still consistent. If these steps succeed, we
will proceed ahead.
1934 /* Prevent cancel/die interrupt while cleaning up */
1935 HOLD_INTERRUPTS();
1936
1937 /* Commit updates to the relation map --- do this as late as
possible */
1938 AtEOXact_RelationMap(true);
1939
1940 /*
1941 * set the current transaction state information appropriately
during
1942 * commit processing
1943 */
1944 s->state = TRANS_COMMIT;
1945
1946 /*
1947 * Here is where we really truly commit.
1948 */
1949 latestXid = RecordTransactionCommit();
1950
1951 TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);
1952
1953 /*
1954 * Let others know about no transaction in progress by me. Note
that this
1955 * must be done _before_ releasing locks we hold and _after_
1956 * RecordTransactionCommit.
1957 */
1958 ProcArrayEndTransaction(MyProc, latestXid);
1959
Local transaction committed. Remote transactions still in PREPAREd state.
Any server (including local) crash or link failure happens here, we leave
the remote transactions dangling in PREPAREd state and manual cleanup will
be required.
1975
1976 CallXactCallbacks(XACT_EVENT_COMMIT);
The postgresql callback pgfdw_xact_callback() commits the PREPAREd
transactions by sending COMMIT TRANSACTION to remote server (my patch). So,
I don't see why would my patch cause inconsistencies. It can cause dangling
PREPAREd transactions and I have already acknowledged that fact.
Am I missing something?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Sat, Jan 3, 2015 at 2:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
While looking at the patch for supporting inheritance on foreign tables,
I
noticed that if a transaction makes changes to more than two foreign
servers the current implementation in postgres_fdw doesn't make sure that
either all of them rollback or all of them commit their changes, IOWthere
is a possibility that some of them commit their changes while others
rollback theirs.PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback()
at
XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the
foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT
event it commits or aborts those transactions resp.TBH, I think this is a pretty awful idea.
In the first place, this does little to improve the actual reliability
of a commit occurring across multiple foreign servers; and in the second
place it creates a bunch of brand new failure modes, many of which would
require manual DBA cleanup.The core of the problem is that this doesn't have anything to do with
2PC as it's commonly understood: for that, you need a genuine external
transaction manager that is aware of all the servers involved in a
transaction, and has its own persistent state (or at least a way to
reconstruct its own state by examining the per-server states).
This patch is not that; in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.As far as failure modes go, one basic reason why this cannot work as
presented is that the remote servers may not even have prepared
transaction support enabled (in fact max_prepared_transactions = 0
is the default in all supported PG versions). So this would absolutely
have to be a not-on-by-default option.
Agreed. We can have a per foreign server option, which says whether the
corresponding server can participate in 2PC. A transaction spanning
multiple foreign server with at least one of them not capable of
participating in 2PC will need to be aborted.
But the bigger issue is that
leaving it to the DBA to clean up after failures is not a production
grade solution, *especially* not for prepared transactions, which are
performance killers if not closed out promptly. So I can't imagine
anyone wanting to turn this on without a more robust answer than that.
I purposefully left that outside this patch, since it involves significant
changes in core. If that's necessary for the first cut, I will work on it.
Basically I think what you'd need for this to be a credible patch would be
for it to work by changing the behavior only in the PREPARE TRANSACTION
path: rather than punting as we do now, prepare the remote transactions,
and report their server identities and gids to an external transaction
manager, which would then be responsible for issuing the actual commits
(along with the actual commit of the local transaction). I have no idea
whether it's feasible to do that without having to assume a particular
2PC transaction manager API/implementation.
I doubt if a TM would expect a bunch of GIDs in response to PREPARE
TRANSACTION command. Per X/Open xa_prepare() expects an integer return
value, specifying whether the PREPARE succeeded or not and some piggybacked
statuses.
In the context of foreign table under inheritance tree, a single DML can
span multiple foreign servers. All such DMLs will then need to be handled
by an external TM. An external TM or application may not have exact idea as
to which all foreign servers are going to be affected by a DML. Users may
not want to setup an external TM in such cases. Instead they would expect
PostgreSQL to manage such DMLs and transactions all by itself.
As Robert has suggested in his responses, it would be better to enable
PostgreSQL to manage distributed transactions itself.
It'd be interesting to hear from people who are using 2PC in production
to find out if this would solve any real-world problems for them, and
what the details of the TM interface would need to look like to make it
work in practice.In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.regards, tom lane
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Tue, Jan 6, 2015 at 11:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jan 5, 2015 at 3:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Well, we intentionally didn't couple the FDW stuff closely into
transaction commit, because of the thought that the "far end" would not
necessarily have Postgres-like transactional behavior, and even if it did
there would be about zero chance of having atomic commit with a
non-Postgres remote server. postgres_fdw is a seriously bad starting
point as far as that goes, because it encourages one to make assumptions
that can't possibly work for any other wrapper.Atomic commit is something that can potentially be supported by many
different FDWs, as long as the thing on the other end supports 2PC.
If you're talking to Oracle or DB2 or SQL Server, and it supports 2PC,
then you can PREPARE the transaction and then go back and COMMIT the
transaction once it's committed locally.
Getting a cluster-wide
*snapshot* is probably a PostgreSQL-only thing requiring much deeper
integration, but I think it would be sensible to leave that as a
future project and solve the simpler problem first.I think the idea I sketched upthread of supporting an external
transaction
manager might be worth pursuing, in that it would potentially lead to
having at least an approximation of atomic commit across heterogeneous
servers.An important threshold question here is whether we want to rely on an
external transaction manager, or build one into PostgreSQL. As far as
this particular project goes, there's nothing that can't be done
inside PostgreSQL. You need a durable registry of which transactions
you prepared on which servers, and which XIDs they correlate to. If
you have that, then you can use background workers or similar to go
retry commits or rollbacks of prepared transactions until it works,
even if there's been a local crash meanwhile.
Alternatively, you could rely on an external transaction manager to do
all that stuff. I don't have a clear sense of what that would entail,
or how it might be better or worse than rolling our own. I suspect,
though, that it might amount to little more than adding a middle man.
I mean, a third-party transaction manager isn't going to automatically
know how to commit a transaction prepared on some foreign server using
some foreign data wrapper. It's going to be have to be taught that if
postgres_fdw leaves a transaction in-medias-res on server OID 1234,
you've got to connect to the target machine using that foreign
server's connection parameters, speak libpq, and issue the appropriate
COMMIT TRANSACTION command. And similarly, you're going to need to
arrange to notify it before preparing that transaction so that it
knows that it needs to request the COMMIT or ABORT later on. Once
you've got all of that infrastructure for that in place, what are you
really gaining over just doing it in PostgreSQL (or, say, a contrib
module thereto)?
Thanks Robert for giving high level view of system needed for PostgreSQL to
be a transaction manager by itself. Agreed completely.
(I'm also concerned that an external transaction manager might need
the PostgreSQL client to be aware of it, whereas what we'd really like
here is for the client to just speak PostgreSQL and be happy that its
commits no longer end up half-done.)--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
I don't see why would my patch cause inconsistencies. It can
cause dangling PREPAREd transactions and I have already
acknowledged that fact.Am I missing something?
To me that is the big problem. Where I have run into ad hoc
distributed transaction managers it has usually been because a
crash left prepared transactions dangling, without cleaning them up
when the transaction manager was restarted. This tends to wreak
havoc one way or another.
If we are going to include a distributed transaction manager with
PostgreSQL, it *must* persist enough information about the
transaction ID and where it is used in a way that will survive a
subsequent crash before beginning the PREPARE on any of the
systems. After all nodes are PREPAREd it must flag that persisted
data to indicate that it is now at a point where ROLLBACK is no
longer an option. Only then can it start committing the prepared
transactions. After the last node is committed it can clear this
information. On start-up the distributed transaction manager must
check for any distributed transactions left "in progress" and
commit or rollback based on the preceding; doing retries
indefinitely until it succeeds or is told to stop.
Doing this incompletely (i.e., not identifying and correctly
handling the various failure modes) is IMO far worse than not
attempting it. If we could build in something that did this
completely and well, that would be a cool selling point; but let's
not gloss over the difficulties. We must recognize how big a
problem it would be to include a low-quality implementation.
Also, as previously mentioned, it must behave in some reasonable
way if a database is not configured to support 2PC, especially
since 2PC is off by default in PostgreSQL.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 7, 2015 at 9:50 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
I don't see why would my patch cause inconsistencies. It can
cause dangling PREPAREd transactions and I have already
acknowledged that fact.Am I missing something?
To me that is the big problem. Where I have run into ad hoc
distributed transaction managers it has usually been because a
crash left prepared transactions dangling, without cleaning them up
when the transaction manager was restarted. This tends to wreak
havoc one way or another.If we are going to include a distributed transaction manager with
PostgreSQL, it *must* persist enough information about the
transaction ID and where it is used in a way that will survive a
subsequent crash before beginning the PREPARE on any of the
systems.
Thanks a lot. I hadn't thought of this.
After all nodes are PREPAREd it must flag that persisted
data to indicate that it is now at a point where ROLLBACK is no
longer an option. Only then can it start committing the prepared
transactions. After the last node is committed it can clear this
information. On start-up the distributed transaction manager must
check for any distributed transactions left "in progress" and
commit or rollback based on the preceding; doing retries
indefinitely until it succeeds or is told to stop.
Agreed.
Doing this incompletely (i.e., not identifying and correctly
handling the various failure modes) is IMO far worse than not
attempting it. If we could build in something that did this
completely and well, that would be a cool selling point; but let's
not gloss over the difficulties. We must recognize how big a
problem it would be to include a low-quality implementation.Also, as previously mentioned, it must behave in some reasonable
way if a database is not configured to support 2PC, especially
since 2PC is off by default in PostgreSQL.
I described one possibility in my reply to Tom's mail. Let me repeat it
here.
We can have a per foreign server option, which says whether the
corresponding server is able to participate in 2PC. A transaction spanning
multiple foreign server with at least one of them not capable of
participating in 2PC will be aborted.
Will that work?
In case a user flags a foreign server as capable to 2PC incorrectly, I
expect the corresponding FDW would raise error (either because PREPARE
fails or FDW doesn't handle that case) and the transaction will be aborted
anyway.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Jan 7, 2015 at 9:50 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Also, as previously mentioned, it must behave in some reasonable
way if a database is not configured to support 2PC, especially
since 2PC is off by default in PostgreSQL.
We can have a per foreign server option, which says whether the
corresponding server is able to participate in 2PC. A transaction
spanning multiple foreign server with at least one of them not
capable of participating in 2PC will be aborted.Will that work?
In case a user flags a foreign server as capable to 2PC
incorrectly, I expect the corresponding FDW would raise error
(either because PREPARE fails or FDW doesn't handle that case)
and the transaction will be aborted anyway.
That sounds like one way to handle it. I'm not clear on how you
plan to determine whether 2PC is required for a transaction.
(Apologies if it was previously mentioned and I've forgotten it.)
I don't mean to suggest that these problems are insurmountable; I
just think that people often underestimate the difficulty of
writing a distributed transaction manager and don't always
recognize the problems that it will cause if all of the failure
modes are not considered and handled.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 7, 2015 at 11:20 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
If we are going to include a distributed transaction manager with
PostgreSQL, it *must* persist enough information about the
transaction ID and where it is used in a way that will survive a
subsequent crash before beginning the PREPARE on any of the
systems. After all nodes are PREPAREd it must flag that persisted
data to indicate that it is now at a point where ROLLBACK is no
longer an option. Only then can it start committing the prepared
transactions. After the last node is committed it can clear this
information. On start-up the distributed transaction manager must
check for any distributed transactions left "in progress" and
commit or rollback based on the preceding; doing retries
indefinitely until it succeeds or is told to stop.
I think one key question here is whether all of this should be handled
in PostgreSQL core or whether some of it should be handled in other
ways. Is the goal to make postgres_fdw (and FDWs for other databases
that support 2PC) to persist enough information that someone *could*
write a transaction manager for PostgreSQL, or is the goal to actually
write that transaction manager?
Just figuring out how to persist the necessary information is a
non-trivial problem by itself. You might think that you could just
insert a row into a local table saying, hey, I'm about to prepare a
transaction remotely, but of course that doesn't work: if you then go
on to PREPARE before writing and flushing the local commit record,
then a crash before that's done leaves a dangling prepared transaction
on the remote note. You might think to write the record, then after
writing and flush the local commit record do the PREPARE. But you
can't do that either, because now if the PREPARE fails you've already
committed locally.
I guess what you need to do is something like:
1. Write and flush a WAL record indicating an intent to prepare, with
a list of foreign server OIDs and GUIDs.
2. Prepare the remote transaction on each node. If any of those
operations fail, roll back any prepared nodes and error out.
3. Commit locally (i.e. RecordTransactionCommit, writing and flushing WAL).
4. Try to commit the remote transactions.
5. Write a WAL record indicating that you committed the remote transactions OK.
If you fail after step 1, you can straighten things out by looking at
the status of the transaction: if the transaction committed, any
transactions we intended-to-prepare need to be checked. If they are
still prepared, we need to commit them or roll them back according to
what happened to our XID.
(Andres is talking in my other ear suggesting that we ought to reuse
the 2PC infrastructure to do all this. I'm not convinced that's a
good idea, but I'll let him present his own ideas here if he wants to
rather than trying to explain them myself.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> wrote:
Andres is talking in my other ear suggesting that we ought to
reuse the 2PC infrastructure to do all this.
If you mean that the primary transaction and all FDWs in the
transaction must use 2PC, that is what I was saying, although
apparently not clearly enough. All nodes *including the local one*
must be prepared and committed with data about the nodes saved
safely off somewhere that it can be read in the event of a failure
of any of the nodes *including the local one*. Without that, I see
this whole approach as a train wreck just waiting to happen.
I'm not really clear on the mechanism that is being proposed for
doing this, but one way would be to have the PREPARE of the local
transaction be requested explicitly and to have that cause all FDWs
participating in the transaction to also be prepared. (That might
be what Andres meant; I don't know.) That doesn't strike me as the
only possible mechanism to drive this, but it might well be the
simplest and cleanest. The trickiest bit might be to find a good
way to persist the distributed transaction information in a way
that survives the failure of the main transaction -- or even the
abrupt loss of the machine it's running on.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
Andres is talking in my other ear suggesting that we ought to
reuse the 2PC infrastructure to do all this.If you mean that the primary transaction and all FDWs in the
transaction must use 2PC, that is what I was saying, although
apparently not clearly enough. All nodes *including the local one*
must be prepared and committed with data about the nodes saved
safely off somewhere that it can be read in the event of a failure
of any of the nodes *including the local one*. Without that, I see
this whole approach as a train wreck just waiting to happen.
Clearly, all the nodes other than the local one need to use 2PC. I am
unconvinced that the local node must write a 2PC state file only to
turn around and remove it again almost immediately thereafter.
I'm not really clear on the mechanism that is being proposed for
doing this, but one way would be to have the PREPARE of the local
transaction be requested explicitly and to have that cause all FDWs
participating in the transaction to also be prepared. (That might
be what Andres meant; I don't know.)
We want this to be client-transparent, so that the client just says
COMMIT and everything Just Works.
That doesn't strike me as the
only possible mechanism to drive this, but it might well be the
simplest and cleanest. The trickiest bit might be to find a good
way to persist the distributed transaction information in a way
that survives the failure of the main transaction -- or even the
abrupt loss of the machine it's running on.
I'd be willing to punt on surviving a loss of the entire machine. But
I'd like to be able to survive an abrupt reboot.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
Andres is talking in my other ear suggesting that we ought to
reuse the 2PC infrastructure to do all this.If you mean that the primary transaction and all FDWs in the
transaction must use 2PC, that is what I was saying, although
apparently not clearly enough. All nodes *including the local one*
must be prepared and committed with data about the nodes saved
safely off somewhere that it can be read in the event of a failure
of any of the nodes *including the local one*. Without that, I see
this whole approach as a train wreck just waiting to happen.Clearly, all the nodes other than the local one need to use 2PC. I am
unconvinced that the local node must write a 2PC state file only to
turn around and remove it again almost immediately thereafter.
The key point is that the distributed transaction data must be
flagged as needing to commit rather than roll back between the
prepare phase and the final commit. If you try to avoid the
PREPARE, flagging, COMMIT PREPARED sequence by building the
flagging of the distributed transaction metadata into the COMMIT
process, you still have the problem of what to do on crash
recovery. You really need to use 2PC to keep that clean, I think.
I'm not really clear on the mechanism that is being proposed for
doing this, but one way would be to have the PREPARE of the local
transaction be requested explicitly and to have that cause all FDWs
participating in the transaction to also be prepared. (That might
be what Andres meant; I don't know.)We want this to be client-transparent, so that the client just says
COMMIT and everything Just Works.
What about the case where one or more nodes doesn't support 2PC.
Do we silently make the choice, without the client really knowing?
That doesn't strike me as the
only possible mechanism to drive this, but it might well be the
simplest and cleanest. The trickiest bit might be to find a good
way to persist the distributed transaction information in a way
that survives the failure of the main transaction -- or even the
abrupt loss of the machine it's running on.I'd be willing to punt on surviving a loss of the entire machine. But
I'd like to be able to survive an abrupt reboot.
As long as people are aware that there is an urgent need to find
and fix all data stores to which clusters on the failed machine
were connected via FDW when there is a hard machine failure, I
guess it is OK. In essence we just document it and declare it to
be somebody else's problem. In general I would expect a
distributed transaction manager to behave well in the face of any
single-machine failure, but if there is one aspect of a
full-featured distributed transaction manager we could give up, I
guess that would be it.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 8, 2015 at 7:02 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Jan 7, 2015 at 9:50 PM, Kevin Grittner <kgrittn@ymail.com>
wrote:
Also, as previously mentioned, it must behave in some reasonable
way if a database is not configured to support 2PC, especially
since 2PC is off by default in PostgreSQL.We can have a per foreign server option, which says whether the
corresponding server is able to participate in 2PC. A transaction
spanning multiple foreign server with at least one of them not
capable of participating in 2PC will be aborted.Will that work?
In case a user flags a foreign server as capable to 2PC
incorrectly, I expect the corresponding FDW would raise error
(either because PREPARE fails or FDW doesn't handle that case)
and the transaction will be aborted anyway.That sounds like one way to handle it. I'm not clear on how you
plan to determine whether 2PC is required for a transaction.
(Apologies if it was previously mentioned and I've forgotten it.)
Any transaction involving more than one server (including local one, I
guess), will require two PC. A transaction may modify and access remote
database but not local one. In such a case, the state of local transaction
doesn't matter once the remote transaction is committed or rolled back.
I don't mean to suggest that these problems are insurmountable; I
just think that people often underestimate the difficulty of
writing a distributed transaction manager and don't always
recognize the problems that it will cause if all of the failure
modes are not considered and handled.--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Thu, Jan 8, 2015 at 8:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 7, 2015 at 11:20 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
If we are going to include a distributed transaction manager with
PostgreSQL, it *must* persist enough information about the
transaction ID and where it is used in a way that will survive a
subsequent crash before beginning the PREPARE on any of the
systems. After all nodes are PREPAREd it must flag that persisted
data to indicate that it is now at a point where ROLLBACK is no
longer an option. Only then can it start committing the prepared
transactions. After the last node is committed it can clear this
information. On start-up the distributed transaction manager must
check for any distributed transactions left "in progress" and
commit or rollback based on the preceding; doing retries
indefinitely until it succeeds or is told to stop.I think one key question here is whether all of this should be handled
in PostgreSQL core or whether some of it should be handled in other
ways. Is the goal to make postgres_fdw (and FDWs for other databases
that support 2PC) to persist enough information that someone *could*
write a transaction manager for PostgreSQL, or is the goal to actually
write that transaction manager?Just figuring out how to persist the necessary information is a
non-trivial problem by itself. You might think that you could just
insert a row into a local table saying, hey, I'm about to prepare a
transaction remotely, but of course that doesn't work: if you then go
on to PREPARE before writing and flushing the local commit record,
then a crash before that's done leaves a dangling prepared transaction
on the remote note. You might think to write the record, then after
writing and flush the local commit record do the PREPARE. But you
can't do that either, because now if the PREPARE fails you've already
committed locally.I guess what you need to do is something like:
1. Write and flush a WAL record indicating an intent to prepare, with
a list of foreign server OIDs and GUIDs.
2. Prepare the remote transaction on each node. If any of those
operations fail, roll back any prepared nodes and error out.
3. Commit locally (i.e. RecordTransactionCommit, writing and flushing WAL).
4. Try to commit the remote transactions.
5. Write a WAL record indicating that you committed the remote
transactions OK.If you fail after step 1, you can straighten things out by looking at
the status of the transaction: if the transaction committed, any
transactions we intended-to-prepare need to be checked. If they are
still prepared, we need to commit them or roll them back according to
what happened to our XID.
When you want to strengthen and commit things, the foreign server may not
be available to do that. As Kevin pointed out in above, we need to keep on
retrying to resolve (commit or rollback based on the status of local
transaction) the PREPAREd transactions on foreign server till they are
resolved. So, we will have to persist the information somewhere else than
the WAL OR keep on persisting the WALs even after the corresponding local
transaction has been committed or aborted, which I don't think is a good
idea, since that will have impact on replication, VACUUM esp. because it's
going to affect the oldest transaction in WAL.
That's where Andres's suggestion might help.
(Andres is talking in my other ear suggesting that we ought to reuse
the 2PC infrastructure to do all this. I'm not convinced that's a
good idea, but I'll let him present his own ideas here if he wants to
rather than trying to explain them myself.)
We can persist the information about distributed transaction (which esp.
require 2PC) similar to the way as 2PC infrastructure in pg_twophase
directory. I am still investigating whether we can re-use existing 2PC
infrastructure or not. My initial reaction is no, since 2PC persists
information about local transaction including locked objects, WALs (?) in
pg_twophase directory, which is not required for a distributed transaction.
But rest of the mechanism like the manner of processing the records during
normal processing and recovery looks very useful.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 1/8/15, 12:00 PM, Kevin Grittner wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
Andres is talking in my other ear suggesting that we ought to
reuse the 2PC infrastructure to do all this.If you mean that the primary transaction and all FDWs in the
transaction must use 2PC, that is what I was saying, although
apparently not clearly enough. All nodes *including the local one*
must be prepared and committed with data about the nodes saved
safely off somewhere that it can be read in the event of a failure
of any of the nodes *including the local one*. Without that, I see
this whole approach as a train wreck just waiting to happen.Clearly, all the nodes other than the local one need to use 2PC. I am
unconvinced that the local node must write a 2PC state file only to
turn around and remove it again almost immediately thereafter.The key point is that the distributed transaction data must be
flagged as needing to commit rather than roll back between the
prepare phase and the final commit. If you try to avoid the
PREPARE, flagging, COMMIT PREPARED sequence by building the
flagging of the distributed transaction metadata into the COMMIT
process, you still have the problem of what to do on crash
recovery. You really need to use 2PC to keep that clean, I think.
If we had an independent transaction coordinator then I agree with you Kevin. I think Robert is proposing that if we are controlling one of the nodes that's participating as well as coordinating the overall transaction that we can take some shortcuts. AIUI a PREPARE means you are completely ready to commit. In essence you're just waiting to write and fsync the commit message. That is in fact the state that a coordinating PG node would be in by the time everyone else has done their prepare. So from that standpoint we're OK.
Now, as soon as ANY of the nodes commit, our coordinating node MUST be able to commit as well! That would require it to have a real prepared transaction of it's own created. However, as long as there is zero chance of any other prepared transactions committing before our local transaction, that step isn't actually needed. Our local transaction will either commit or abort, and that will determine what needs to happen on all other nodes.
I'm ignoring the question of how the local node needs to store info about the other nodes in case of a crash, but AFAICT you could reliably recover manually from what I just described.
I think the question is: are we OK with "going under the skirt" in this fashion? Presumably it would provide better performance, whereas forcing ourselves to eat our own 2PC dogfood would presumably make it easier for someone to plugin an external coordinator instead of using our own. I think there's also a lot to be said for getting a partial implementation of this available today (requiring manual recovery), so long as it's not in core.
BTW, I found https://www.cs.rutgers.edu/~pxk/417/notes/content/transactions.html a useful read, specifically the 2PC portion.
I'm not really clear on the mechanism that is being proposed for
doing this, but one way would be to have the PREPARE of the local
transaction be requested explicitly and to have that cause all FDWs
participating in the transaction to also be prepared. (That might
be what Andres meant; I don't know.)We want this to be client-transparent, so that the client just says
COMMIT and everything Just Works.What about the case where one or more nodes doesn't support 2PC.
Do we silently make the choice, without the client really knowing?
We abort. (Unless we want to have a running_with_scissors GUC...)
That doesn't strike me as the
only possible mechanism to drive this, but it might well be the
simplest and cleanest. The trickiest bit might be to find a good
way to persist the distributed transaction information in a way
that survives the failure of the main transaction -- or even the
abrupt loss of the machine it's running on.I'd be willing to punt on surviving a loss of the entire machine. But
I'd like to be able to survive an abrupt reboot.As long as people are aware that there is an urgent need to find
and fix all data stores to which clusters on the failed machine
were connected via FDW when there is a hard machine failure, I
guess it is OK. In essence we just document it and declare it to
be somebody else's problem. In general I would expect a
distributed transaction manager to behave well in the face of any
single-machine failure, but if there is one aspect of a
full-featured distributed transaction manager we could give up, I
guess that would be it.
ISTM that one option here would be to "simply" write and sync WAL record(s) of all externally prepared transactions. That would be enough for a hot standby to find all the other servers and tell them to either commit or abort, based on whether our local transaction committed or aborted. If you wanted, you could even have the standby be responsible for telling all the other participants to commit...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers