pg_rewind hangs if --source-server is used and syncrep is enabled
Hi,
if pg_rewind is told to fetch data via a libpq connection
(--source-server), synchronous replication is enabled and there is only
one sync standby (pilot error, but sill); pg_rewinding the old master
hangs at the CREATE TEMPORARY TABLE step (CREATE TABLE waiting for
X/XXXXXXXX). At least this happened to one of our clients while
evaluating pg_rewind.
To the user, the last thing printed is "need to copy XXXX MB [...]". If
the user cancels the pg_rewind command with ^C, the backend keeps
hanging around even in --dry-run mode. That won't hurt too much as it
does not seem to block future pg_rewind runs after synchronous_commit
has been set to a different value, but looks surprising to me.
Not sure whether pg_rewind could error out gracefully without hanging in
this case, but maybe it could/should clean up the query on SIGTERM? And
at the least, maybe this caveat should be documented?
Michael
--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de
credativ GmbH, HRB M�nchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 M�nchengladbach
Gesch�ftsf�hrung: Dr. Michael Meskes, J�rg Folz, Sascha Heuer
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 05, 2016 at 04:39:39PM +0200, Michael Banck wrote:
if pg_rewind is told to fetch data via a libpq connection
(--source-server), synchronous replication is enabled and there is only
one sync standby (pilot error, but sill); pg_rewinding the old master
hangs at the CREATE TEMPORARY TABLE step (CREATE TABLE waiting for
X/XXXXXXXX). At least this happened to one of our clients while
evaluating pg_rewind.To the user, the last thing printed is "need to copy XXXX MB [...]". If
the user cancels the pg_rewind command with ^C, the backend keeps
hanging around even in --dry-run mode. That won't hurt too much as it
does not seem to block future pg_rewind runs after synchronous_commit
has been set to a different value, but looks surprising to me.Not sure whether pg_rewind could error out gracefully without hanging in
this case,
My colleague Christoph Berg pointed out that pg_rewind could just set
synchronous_commit = local before creating the temporary table, which
indeed works, proof-of-concept patch attached
Michael
--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de
credativ GmbH, HRB M�nchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 M�nchengladbach
Gesch�ftsf�hrung: Dr. Michael Meskes, J�rg Folz, Sascha Heuer
Attachments:
pg_rewind.difftext/x-diff; charset=us-asciiDownload
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 9239009..a5cb426 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -406,7 +406,7 @@ libpq_executeFileMap(filemap_t *map)
* First create a temporary table, and load it with the blocks that we
* need to fetch.
*/
- sql = "CREATE TEMPORARY TABLE fetchchunks(path text, begin int4, len int4);";
+ sql = "SET synchronous_commit = local; CREATE TEMPORARY TABLE fetchchunks(path text, begin int4, len int4);";
res = PQexec(conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
On Wed, Oct 5, 2016 at 11:53 PM, Michael Banck
<michael.banck@credativ.de> wrote:
On Wed, Oct 05, 2016 at 04:39:39PM +0200, Michael Banck wrote:
To the user, the last thing printed is "need to copy XXXX MB [...]". If
the user cancels the pg_rewind command with ^C, the backend keeps
hanging around even in --dry-run mode. That won't hurt too much as it
does not seem to block future pg_rewind runs after synchronous_commit
has been set to a different value, but looks surprising to me.
Oops.
Not sure whether pg_rewind could error out gracefully without hanging in
this case,My colleague Christoph Berg pointed out that pg_rewind could just set
synchronous_commit = local before creating the temporary table, which
indeed works, proof-of-concept patch attached
Even synchronous_commit = off would not matter much, and we could just
use that for performance reasons. The temporary table used in this
context is just used to track the delta chunks of blocks, so this
solution sounds better to me. I'll patch 9.4's pg_rewind similarly to
what is decided here, and we could as well use an extra PQexec call to
bring more clarity for the code, now an extra round-trip could be a
big deal where network lag matters, but compared to the COPY chunks
sent out that would not matter much I guess. I am just posting another
version, and added a CF entry to not lose track of it:
https://commitfest.postgresql.org/11/811/
--
Michael
Attachments:
rewind-sync.patchtext/x-diff; charset=US-ASCII; name=rewind-sync.patchDownload
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 9239009..4bc2bc3 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -403,6 +403,17 @@ libpq_executeFileMap(filemap_t *map)
int i;
/*
+ * Set up connection context, with synchronous replication enabled the
+ * following command could stuck the connection.
+ */
+ sql = "SET synchronous_commit = off;";
+ res = PQexec(conn, sql);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not set up connection context: %s",
+ PQresultErrorMessage(res));
+ PQclear(res);
+
+ /*
* First create a temporary table, and load it with the blocks that we
* need to fetch.
*/
On 10/06/2016 02:24 AM, Michael Paquier wrote:
On Wed, Oct 5, 2016 at 11:53 PM, Michael Banck
<michael.banck@credativ.de> wrote:My colleague Christoph Berg pointed out that pg_rewind could just set
synchronous_commit = local before creating the temporary table, which
indeed works, proof-of-concept patch attachedEven synchronous_commit = off would not matter much, and we could just
use that for performance reasons. The temporary table used in this
context is just used to track the delta chunks of blocks, so this
solution sounds better to me. I'll patch 9.4's pg_rewind similarly to
what is decided here, and we could as well use an extra PQexec call to
bring more clarity for the code, now an extra round-trip could be a
big deal where network lag matters, but compared to the COPY chunks
sent out that would not matter much I guess.
Committed, thanks! I moved the call to where we establish the
connection, that felt slightly more natural.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 6, 2016 at 7:37 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Committed, thanks! I moved the call to where we establish the connection,
that felt slightly more natural.
Thanks for the commit. Indeed that's better with the other sanity checks.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers