PostgreSQL replication lag - Suggestions and questions

Started by Lucasalmost 5 years ago2 messagesgeneral

root@sud0.nz

almost 5 years ago

Hi,

I have a cluster of PostgreSQL 9.2.21, where there is one master and one slave with streaming replication.
I have few points and questions about the replication, and was hopping you guys could share your opinions, suggestions and experiences.

Before I start; Yes! I know... PG 9.2? Really? Well... we're working on a migration project.. We're considering either EnterpriseDB or RDS (we're already in EC2 instances in AWS).

My current architecture is: master (AZ1) --> read slave (AZ2) --> hot-standby (AZ2) - They are all using streaming replication.

All read-only queries are sent to the read slave. Sometimes the replication lag between the master and the slaves reaches up to 10 minutes. I understand that the lag is expected in any replication scenario, and below you can find some suggestions that I think would help to minimize the lag time.

- Having the read slave in the same AZ as its master - for better network throughput;
- Having the latest PostgreSQL version to get its best performance

- For the replication, we use Streaming Replication. A native PostgreSQL solution that was first introduced in PostgreSQL 9.0 version.

- So, that means that we are using its very very early version. Many improvements have been introduced since 9.x which we’re not taking advantage of.

- Having the latest Operational System behind PostgreSQL to get its best IO performance

- We’re still on Ubuntu 16.04.2 for both Master and Slaves. Again, a lot of performance improvements were introduced in the new Ubuntu version 20.x, which we’re not taking advantage of.

- Consider changing the read slave to be synchronous and not asynchronous
- Consider having multiple slaves and not just one big instance
- Consider spreading the load between the master and the slaves with a Pooling software (PGPOOL)

- Currently this is done at the application level (PHP)
- The master should also do read-only queries. Why not?

Do you agree?
Do you have any other suggestions?
Is there anything I could do now to minimize the replication lag, or since we're working on a migration there is no point wasting our time?

---
Regards,

Lucas

Show quoted text

This message is encrypted. Both the Public Key and the GPG encrypted message are included in this email so that you can verify its origin.

Vijaykumar Jain

vijaykumarjain.github@gmail.com

almost 5 years ago

In reply to: Lucas (#1)

Re: PostgreSQL replication lag - Suggestions and questions

My current architecture is: master (AZ1) --> read slave (AZ2) -->

hot-standby (AZ2) - They are all using streaming replication.

I am not sure of the difference between read replica and a hot standby. At
least with later versions hot standby allows read queries.
I mean unless you do not want queries to go to AZ2 node, both of them
should be able to server the queries.
with libpq 10 onwards, you can provide multiple nodes in the connection
string, which would "automatically" figure out next available node in to
query if one is not responding.
See
PostgreSQL: Documentation: 13: 33.1. Database Connection Control Functions
<https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-MULTIPLE-HOSTS>

All read-only queries are sent to the read slave. Sometimes the

replication lag between the master and the slaves reaches up to 10
minutes.I understand that the lag is expected in any replication scenario,
and below you can find some suggestions that I think would help to minimize
the lag time.

lags can be a result of multiple reasons like,
Intermittent connection issues,
network unable to keep up with WAL changes. A typical example would be any
action that performs a table rewrite of a huge table.
this can result in query getting cancelled on the replica and it may think
you might query stale data.
you can tune the replica with these params if you want to avoid query
cancellations but are ok with little stale data.
also if the application can initiate a retry for the cancelled query
exception if they can handle at app layer itself.
RUNTIME-CONFIG-REPLICATION-STANDBY
<https://www.postgresql.org/docs/current/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY>

If you cannot deal with stale data at all, then you have two options.
1) synchronous replication
synchronous replication in very simple terms would mean, if you have a
cluster like
Primary -> ( Replica R1, Replica R2) the primary would wait for
acknowledgement from at least one of the replicas. this may impact your
tps, but you get consistency.
but geo setups with synchronous replication would be problematic.
2) read from the primary alone.
this is always the safest option, as no intermediate network, but this
would also mean all the read load on the primary will impact the db
performance and the housekeeping work like
vacuuming. and incase the server goes down, till the time you failover, you
cannot perform any read or writes depending on the catch up time.
SYNCHRONOUS-REPLICATION
<https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION>

Having the read slave in the same AZ as its master - for better network

throughput;
AZ spread is meant for dealing with availability. It may be ok to have some
read replica in some AZ, but also have some in another AZ in case of a
major outage.
if you can use a load balancer, you can provide more weight to local
replicas then the remote replicas etc and use a health check like
replication lag to ensure read request
are served by nodes with tolerable lag only.
scaling-postgresql-using-connection-poolers-and-load-balancers-for-an-enterprise-grade-environment/
<https://www.percona.com/blog/2018/10/02/scaling-postgresql-using-connection-poolers-and-load-balancers-for-an-enterprise-grade-environment/>

Having the latest PostgreSQL version to get its best performance

mostly always. not just improvements, but a lot of new features like
partitioning, logical replication, stored procedures etc are now available
in newer versions.
PostgreSQL: Release Notes <https://www.postgresql.org/docs/release/>

Having the latest Operational System behind PostgreSQL to get its best IO

performance
yes at the software layer, latest versions would be more performant
"mostly".
but still the hardware layer plays an important role.
If you want to really validate improvements, you can run pgbench/sysbench
stress tests on the interested versions and see the changes yourself.

upgrade is always needed. not just for performance reasons, but security
reasons as well.
Also, keeping the upgrade on hold for a time, may mean jumping across
multiple versions which although might work,
but you would need to read the change log across the versions to ensure
nothing would be broken,
Hence regular upgrades avoid the pain.

--
Thanks,
Vijay
Mumbai, India

PostgreSQL replication lag - Suggestions and questions

Attachments: