Determine state of cluster (HA)
-Hackers,
I had a long call with a firm developing front end proxy/cache/HA for
Postgres today. Essentially the software is a replacement for PGPool in
entirety but also supports analytics etc... When I was asking them about
pain points they talked about the below and I was wondering if this is a
problem we would like to solve:
Per your request, here is our failover issue.
1. In a modern devops environment, the database should be able to scale
and morph over time based on need.
2. Tools that are leveraging the database should be able to easily
discover and potentially control (with permissions) the database.
Currently, you can discover the master and what nodes are syncing off of
it, but on a failure, a tool can't easily discover what orchestration
has done on the back-end to make the cluster whole again, i.e. from the
slave, you can't discover the master reliably and easily.
The logic that our code now uses is to:
1. Find the master
2. Add replication nodes per the master's configuration.
To find a master, we start with a list of candidate nodes that MAY be a
master at any point, and:
1. issue "SELECT pg_is_in_recovery()" to find if it is a slave
a. If so, use "SELECT pg_read_file('recovery.conf')" to extract the host
b. Attempt to connect to the host directly, if not...
c. use the slave and use the hostname via dblink to connect to the
master, as the hostname , i.e. select * from dblink('" + connInfo + "
dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr
inet). This is necessary in the event the hostname used in the
recovery.conf file is not resolvable from the outside.
d. Use the dblink connection to ID the master node via select
inet_server_addr();
e. connect to the IP provided by the master.
f. Repeat through nodes until we get a master.
Issues:
1. The dblink call doesn't have a way to specify a timeout, so we have
to use Java futures to control how long this may take to a reasonable
amount of time;
2. NAT mapping may result in us detecting IP ranges that are not
accessible to the application nodes.
3. there is no easy way to monitor for state changes as they happen,
allowing faster failovers, everything has to be polled based on events;
4. It doesn't support cascading replication very well, although we
could augment the logic to allow us to map the relationship between nodes.
5. There is no way to connect to a db node with something akin to
SQL-Server's "application intent" flags, to allow a connection to be
rejected if we wish it to be a read/write connection. This helps detect
the state of the node directly without having to ask any further
questions of the node, and makes it easier to "stall" during connection
until a proper connection can be made.
6. The master, on shutdown, will not actually close and disable
connections as it shuts down, instead, it will issue an error that it is
shutting down as it does so.
Fundamentally, the biggest issue is that it is very hard to determine
the state of the cluster by asking all the nodes, in particular in the
case of a failure. Some state information is lost that is necessary to
talk to the cluster moving forward in a reliable manner.
--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
***** Unless otherwise stated, opinions are my own. *****
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/12/2017 05:50 PM, Joshua D. Drake wrote:
-Hackers,
Bumping this.
I had a long call with a firm developing front end proxy/cache/HA for
Postgres today. Essentially the software is a replacement for PGPool in
entirety but also supports analytics etc... When I was asking them about
pain points they talked about the below and I was wondering if this is a
problem we would like to solve:Per your request, here is our failover issue.
1. In a modern devops environment, the database should be able to scale
and morph over time based on need.
2. Tools that are leveraging the database should be able to easily
discover and potentially control (with permissions) the database.
Currently, you can discover the master and what nodes are syncing off of
it, but on a failure, a tool can't easily discover what orchestration
has done on the back-end to make the cluster whole again, i.e. from the
slave, you can't discover the master reliably and easily.The logic that our code now uses is to:
1. Find the master
2. Add replication nodes per the master's configuration.To find a master, we start with a list of candidate nodes that MAY be a
master at any point, and:
1. issue "SELECT pg_is_in_recovery()" to find if it is a slave
a. If so, use "SELECT pg_read_file('recovery.conf')" to extract the host
b. Attempt to connect to the host directly, if not...
c. use the slave and use the hostname via dblink to connect to the
master, as the hostname , i.e. select * from dblink('" + connInfo + "
dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr
inet). This is necessary in the event the hostname used in the
recovery.conf file is not resolvable from the outside.
d. Use the dblink connection to ID the master node via select
inet_server_addr();
e. connect to the IP provided by the master.
f. Repeat through nodes until we get a master.Issues:
1. The dblink call doesn't have a way to specify a timeout, so we have
to use Java futures to control how long this may take to a reasonable
amount of time;
2. NAT mapping may result in us detecting IP ranges that are not
accessible to the application nodes.
3. there is no easy way to monitor for state changes as they happen,
allowing faster failovers, everything has to be polled based on events;
4. It doesn't support cascading replication very well, although we
could augment the logic to allow us to map the relationship between nodes.
5. There is no way to connect to a db node with something akin to
SQL-Server's "application intent" flags, to allow a connection to be
rejected if we wish it to be a read/write connection. This helps detect
the state of the node directly without having to ask any further
questions of the node, and makes it easier to "stall" during connection
until a proper connection can be made.
6. The master, on shutdown, will not actually close and disable
connections as it shuts down, instead, it will issue an error that it is
shutting down as it does so.Fundamentally, the biggest issue is that it is very hard to determine
the state of the cluster by asking all the nodes, in particular in the
case of a failure. Some state information is lost that is necessary to
talk to the cluster moving forward in a reliable manner.
--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
***** Unless otherwise stated, opinions are my own. *****
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote:
-Hackers,
I had a long call with a firm developing front end proxy/cache/HA for
Postgres today. Essentially the software is a replacement for PGPool in
entirety but also supports analytics etc... When I was asking them about
pain points they talked about the below and I was wondering if this is a
problem we would like to solve.
IMO: no one node knows the full state of the system, or can know it.
I'd love PostgreSQL to help users more with scaling, HA, etc. But I
think it's a big job. We'd need:
- a node topology of some kind, communicated between nodes
- heartbeat and monitoring
- failover coordination
- pooling/proxying
- STONITH/fencing
- etc.
That said, I do think it'd be very desirable for us to introduce a
greater link from a standby to master:
- Get info about master. We should finish merging recovery.conf into
postgresql.conf.
-
b. Attempt to connect to the host directly, if not...
c. use the slave and use the hostname via dblink to connect to the master,
as the hostname , i.e. select * from dblink('" + connInfo + "
dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet).
This is necessary in the event the hostname used in the recovery.conf file
is not resolvable from the outside.
OK, so "connect directly" here means from some 3rd party, the one
doing the querying of the replica.
1. The dblink call doesn't have a way to specify a timeout, so we have to
use Java futures to control how long this may take to a reasonable amount of
time;
statement_timeout doesn't work?
If not, that sounds like a sensible, separate feature to add. Patches welcome!
2. NAT mapping may result in us detecting IP ranges that are not accessible
to the application nodes.
PostgreSQL can't do anything about this one.
3. there is no easy way to monitor for state changes as they happen,
allowing faster failovers, everything has to be polled based on events;
It'd be pretty simple to write a function that sleeps in the backend
until it's promoted. I don't know off the top of my head if we set all
proc latches when we promote, but if we don't it's probably harmless
and somewhat useful to do so.
Either way, you'd do long-polling. Call the function and let the
connection block until something interesting happens. Use TCP
keepalives to make sure you notice if it dies. Have the function
return when the state changes.
4. It doesn't support cascading replication very well, although we could
augment the logic to allow us to map the relationship between nodes.
5. There is no way to connect to a db node with something akin to
SQL-Server's "application intent" flags, to allow a connection to be
rejected if we wish it to be a read/write connection. This helps detect the
state of the node directly without having to ask any further questions of
the node, and makes it easier to "stall" during connection until a proper
connection can be made.
That sounds desirable, and a good step toward eventually being able to
transparently re-route read/write queries from replica to master.
Which is where I'd like to land up eventually.
Again, that'd be a sensible patch to submit, quite separately to the
other topics.
6. The master, on shutdown, will not actually close and disable connections
as it shuts down, instead, it will issue an error that it is shutting down
as it does so.
Er, yes? I don't understand what you are getting at here.
Can you describe expected vs actual behaviour in more detail?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 16, 2017 at 4:39 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote:
5. There is no way to connect to a db node with something akin to
SQL-Server's "application intent" flags, to allow a connection to be
rejected if we wish it to be a read/write connection. This helps detectthe
state of the node directly without having to ask any further questions of
the node, and makes it easier to "stall" during connection until a proper
connection can be made.That sounds desirable, and a good step toward eventually being able to
transparently re-route read/write queries from replica to master.
Which is where I'd like to land up eventually.
It also sounds a lot like the connection parameter target_session_attrs,
does it not? We don't reroute active connections based on it, and we're not
smart enough to do anything beyond "try them one by one until you reach the
one with the correct attributes", but the basic functionality is there.
Basically what we already have fulfills what JD is suggesting, but not what
Craig is, if I understand it correctly.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
On Mon, 16 Oct 2017 10:39:16 +0800
Craig Ringer <craig@2ndquadrant.com> wrote:
On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote:
I had a long call with a firm developing front end proxy/cache/HA for
Postgres today. Essentially the software is a replacement for PGPool in
entirety but also supports analytics etc... When I was asking them about
pain points they talked about the below and I was wondering if this is a
problem we would like to solve.IMO: no one node knows the full state of the system, or can know it.
+1
I'd love PostgreSQL to help users more with scaling, HA, etc. But I
think it's a big job. We'd need:- a node topology of some kind, communicated between nodes
- heartbeat and monitoring
- failover coordination
- pooling/proxying
- STONITH/fencing
- etc.
And some of items on this list can not be in core. However, there's some things
PostgreSQL can do to make HA easier to deal with.
That said, I do think it'd be very desirable for us to introduce a
greater link from a standby to master:- Get info about master. We should finish merging recovery.conf into
postgresql.conf.
agree. +1.
To make things easier from the "cluster manager" piece outside of PostgreSQL, I
would add:
* being able to "demote" a master as a standby without restart.
* being able to check the status of each node without eating a backend
connection (to avoid hitting "max_connection" limit)
* being able to monitor each step of a switchover (or "controlled
failover": standby/master role swapping between two nodes)
b. Attempt to connect to the host directly, if not...
c. use the slave and use the hostname via dblink to connect to the master,
as the hostname , i.e. select * from dblink('" + connInfo + "
dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet).
This is necessary in the event the hostname used in the recovery.conf file
is not resolvable from the outside.OK, so "connect directly" here means from some 3rd party, the one
doing the querying of the replica.
It seems to me the failover process is issuing all required commands to move the
master role to another available standby. The knowledge of the orchestration
and final status (if everything went good) is in this piece of software. If you
want to know where is your master in an exotic or complex setup, ask who was
responsible to promote your master.
HA should stay as simple as possible. The more the architecture is complex, the
more you will have failing scenarios.
1. The dblink call doesn't have a way to specify a timeout, so we have to
use Java futures to control how long this may take to a reasonable amount of
time;statement_timeout doesn't work?
If not, that sounds like a sensible, separate feature to add. Patches welcome!
2. NAT mapping may result in us detecting IP ranges that are not accessible
to the application nodes.PostgreSQL can't do anything about this one.
You could get the master IP address from the "pg_stat_wal_receiver" view. But
this is still not enough though. You might have dedicated networks for
applications and for pgsql replication both separated. If you want a standby
to tell the application where to connect to the master then you'll have to
put this information yourself somewhere, accessible from application nodes.
3. there is no easy way to monitor for state changes as they happen,
allowing faster failovers, everything has to be polled based on events;
In the corosync world (the clustering piece of the Pacemaker ecosystem), node
failure are detected really really fast. About 1s.
Considering application failure (pgsql here), this will be polling, yes. But I
fail to imagine how a dying application can warn the cluster before dying. Not
only crashing (systemd could help there), but eg. before entering an infinite
dummy loop or an exhausting one.
It'd be pretty simple to write a function that sleeps in the backend
until it's promoted. I don't know off the top of my head if we set all
proc latches when we promote, but if we don't it's probably harmless
and somewhat useful to do so.
As soon as the cluster manager promoted a new master, it can trigger and event
to notify whatever you need.
Either way, you'd do long-polling. Call the function and let the
connection block until something interesting happens. Use TCP
keepalives to make sure you notice if it dies. Have the function
return when the state changes.
This would still rely on TCP keepalive frequency, back to polling :(
Regards,
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/15/2017 07:39 PM, Craig Ringer wrote:
On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote:
-Hackers,
I had a long call with a firm developing front end proxy/cache/HA for
Postgres today. Essentially the software is a replacement for PGPool in
entirety but also supports analytics etc... When I was asking them about
pain points they talked about the below and I was wondering if this is a
problem we would like to solve.IMO: no one node knows the full state of the system, or can know it.
That isn't exactly true. We do know if our replication state is current
but only from the master which is part of the problem.
I'd love PostgreSQL to help users more with scaling, HA, etc. But I
think it's a big job. We'd need:- a node topology of some kind, communicated between nodes
- heartbeat and monitoring
- failover coordination
- pooling/proxying
- STONITH/fencing
- etc.
I don't think we need all of that. This is more of a request to make it
easier for those deploying HA to determine the state of Postgres.
That said, I do think it'd be very desirable for us to introduce a
greater link from a standby to master:- Get info about master. We should finish merging recovery.conf into
postgresql.conf.
Definitely.
b. Attempt to connect to the host directly, if not...
c. use the slave and use the hostname via dblink to connect to the master,
as the hostname , i.e. select * from dblink('" + connInfo + "
dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet).
This is necessary in the event the hostname used in the recovery.conf file
is not resolvable from the outside.OK, so "connect directly" here means from some 3rd party, the one
doing the querying of the replica.1. The dblink call doesn't have a way to specify a timeout, so we have to
use Java futures to control how long this may take to a reasonable amount of
time;statement_timeout doesn't work?
That would be a work around definitely but I think it would be better to
say: ALTER SYSTEM SET PROMOTE TIMEOUT '120' (Example, let's not get off
into the weeds :P) and if the standby can't receive a ping/ack within
120 it will promote itself.
PostgreSQL can't do anything about this one.
Yes that's true.
4. It doesn't support cascading replication very well, although we could
augment the logic to allow us to map the relationship between nodes.
5. There is no way to connect to a db node with something akin to
SQL-Server's "application intent" flags, to allow a connection to be
rejected if we wish it to be a read/write connection. This helps detect the
state of the node directly without having to ask any further questions of
the node, and makes it easier to "stall" during connection until a proper
connection can be made.That sounds desirable, and a good step toward eventually being able to
transparently re-route read/write queries from replica to master.
Which is where I'd like to land up eventually.Again, that'd be a sensible patch to submit, quite separately to the
other topics.
Great.
6. The master, on shutdown, will not actually close and disable connections
as it shuts down, instead, it will issue an error that it is shutting down
as it does so.Er, yes? I don't understand what you are getting at here.
Yes, I will need to go back to them on this one. I think what they mean
is that if we have a connection that is getting closed it doesn't return
why it is closing. It just throws an error.
Can you describe expected vs actual behaviour in more detail?
I will need to get back to them on this but I think the behavior would
be to have a return value of why the connection was closed vs just
throwing an error. Say, "RETURN 66" means someone executed
pg_going_to_failover() vs pg_terminate_backend() which could be for
different reasons.
Thanks for responding, I am mostly the intermediary here,
JD
--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
***** Unless otherwise stated, opinions are my own. *****
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/16/2017 03:55 AM, Magnus Hagander wrote:
On Mon, Oct 16, 2017 at 4:39 AM, Craig Ringer <craig@2ndquadrant.com
<mailto:craig@2ndquadrant.com>> wrote:On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com
<mailto:jd@commandprompt.com>> wrote:5. There is no way to connect to a db node with something akin to
SQL-Server's "application intent" flags, to allow a connection to be
rejected if we wish it to be a read/write connection. This helps detect the
state of the node directly without having to ask any further questions of
the node, and makes it easier to "stall" during connection until a proper
connection can be made.That sounds desirable, and a good step toward eventually being able to
transparently re-route read/write queries from replica to master.
Which is where I'd like to land up eventually.It also sounds a lot like the connection parameter target_session_attrs
Ahh, this is part of the new libpq failover right?
Thanks,
JD
--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
***** Unless otherwise stated, opinions are my own. *****
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 October 2017 at 01:02, Joshua D. Drake <jd@commandprompt.com> wrote:
On 10/15/2017 07:39 PM, Craig Ringer wrote:
On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote:
-Hackers,
I had a long call with a firm developing front end proxy/cache/HA for
Postgres today. Essentially the software is a replacement for PGPool in
entirety but also supports analytics etc... When I was asking them about
pain points they talked about the below and I was wondering if this is a
problem we would like to solve.IMO: no one node knows the full state of the system, or can know it.
That isn't exactly true. We do know if our replication state is current but
only from the master which is part of the problem.
Sure. But unless you have a perfectly-reliable, latency-ignoring
wormhole link between master and standby, the standby always has
imperfect knowledge of the master. More importantly, it can never know
for sure how old its knowledge is.
https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-WAL-RECEIVER-VIEW
already does about the best we can probably do. In particular
last_msg_send_time and last_msg_receipt_time, used in combination with
latest_end_lsn and latest_end_time.
That said, I do think it'd be very desirable for us to introduce a
greater link from a standby to master:- Get info about master. We should finish merging recovery.conf into
postgresql.conf.Definitely.
There's a patch from Abhijit Menon-Sen you could adopt for PostgreSQL
11 for that.
1. The dblink call doesn't have a way to specify a timeout, so we have
to
use Java futures to control how long this may take to a reasonable amount
of
time;statement_timeout doesn't work?
That would be a work around definitely but I think it would be better to
say: ALTER SYSTEM SET PROMOTE TIMEOUT '120' (Example, let's not get off into
the weeds :P) and if the standby can't receive a ping/ack within 120 it will
promote itself.
I'm confused by this. I thought you were talking about timeouts
querying status of an upstream over dblink. Not automatic
self-promotion.
I'm really not a fan of Pg standbys self-promoting without working
with an external co-ordinator that handles STONITH/fencing. It's a
recipe for disaster. That's what I was saying upthread, that
implementing bits and pieces here can be quite dangerous.
This also takes it well outside what you were talking about, improving
the ability to detect Pg's state, and into having it become its own
co-ordinator for HA actions.
So lets go back to the original question. What's missing that
statement_timeout doesn't provide for querying remote servers for
their status over dblink?
If you want a nicer way to say "look up whatever your conninfo in
recovery.conf is, connect to it, get me some info on it and return it,
possibly daisy-chaining up a chain of replicas if you reach the
master" ... that's fine. But it's a different thing.
Er, yes? I don't understand what you are getting at here.
Yes, I will need to go back to them on this one. I think what they mean is
that if we have a connection that is getting closed it doesn't return why it
is closing. It just throws an error.
Yes, we do. From
https://www.postgresql.org/docs/current/static/errcodes-appendix.html:
Class 57 — Operator Intervention
57000 operator_intervention
57014 query_canceled
57P01 admin_shutdown
57P02 crash_shutdown
57P03 cannot_connect_now
57P04 database_dropped
Maybe they want more granularity in terms of what reasons are given
and what errors are reported. That's fine. But please provide
specifics.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/16/2017 07:31 PM, Craig Ringer wrote:
On 17 October 2017 at 01:02, Joshua D. Drake <jd@commandprompt.com> wrote:
On 10/15/2017 07:39 PM, Craig Ringer wrote:
- Get info about master. We should finish merging recovery.conf into
postgresql.conf.Definitely.
There's a patch from Abhijit Menon-Sen you could adopt for PostgreSQL
11 for that.
Do you have a link to this?
Thanks!
JD
--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
***** Unless otherwise stated, opinions are my own. *****
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/17/2017 08:40 PM, Joshua D. Drake wrote:
On 10/16/2017 07:31 PM, Craig Ringer wrote:
On 17 October 2017 at 01:02, Joshua D. Drake <jd@commandprompt.com>
wrote:On 10/15/2017 07:39 PM, Craig Ringer wrote:
- Get info about master. We should finish merging recovery.conf into
postgresql.conf.Definitely.
There's a patch from Abhijit Menon-Sen you could adopt for PostgreSQL
11 for that.Do you have a link to this?
https://commitfest.postgresql.org/search/?searchterm=recovery.conf
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers