Return of the pg_wal issue..

Started by Paul Brindusaabout 1 year ago12 messagesgeneral

paulbrindusa88@gmail.com

about 1 year ago

Good afternoon,

Following below we are facing a similar issue and im getting a real buzz to
get this working myself, speaking to my DBA in the company has actually
left me a bit cold as he is not good with postgres.

So I want to try and get a solution for this and fix this issue with the
pg_wal files filling up the drive at a ridiculous rate. I have been
manually moving logs to a different directory but have had no luck in
finding an actual solution.

The cluster is a 3 node cluster with HA which is running wirth patroni.

Please help me out, I will mention that I have test cluster spun up in case
something needs testing.

Also want to give a shout out to Lorenz Albe's for posting stuff about wal
files on his company blog.

Again any help will be greatly appreciated.

" On one of our postgres instances we have the pg_wal/data folder up to
196GB, out of 200GB disk filled up.
This has stopped the posgresql.service this morning causing two
applications to crash.
Unfortunately our database admin is on leave today, and we are trying to
figure out how to get the disk down?
Any ideas or suggestions are more than welcome.

Thank you in advance."

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

Adrian Klaver

adrian.klaver@aklaver.com

about 1 year ago

In reply to: Paul Brindusa (#1)

Re: Return of the pg_wal issue..

On 1/22/25 09:33, Paul Brindusa wrote:

Good afternoon,

Following below we are facing a similar issue and im getting a real buzz
to get this working myself, speaking to my DBA in the company has
actually left me a bit cold as he is not good with postgres.

So I want to try and get a solution for this and fix this issue with the
pg_wal files filling up the drive at a ridiculous rate. I have been
manually moving logs to a different directory but have had no luck in
finding an actual solution.

The cluster is a 3 node cluster with HA which is running wirth patroni.

Please help me out, I will mention that I have test cluster spun up in
case something needs testing.

Also want to give a shout out to Lorenz Albe's for posting stuff about
wal files on his company blog.

Again any help will be greatly appreciated.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

2) The Patroni version.

3) The Patroni configuration.

4) Definition of 'ridiculous rate'.

5) Relevant information from the logs.

" On one of our postgres instances we have the pg_wal/data folder up to
196GB, out of 200GB disk filled up.
This has stopped the posgresql.service this morning causing two
applications to crash.
Unfortunately our database admin is on leave today, and we are trying to
figure out how to get the disk down?
Any ideas or suggestions are more than welcome.

Thank you in advance."

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com <mailto:paulbrindusa88@gmail.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

Laurenz Albe

laurenz.albe@cybertec.at

about 1 year ago

In reply to: Paul Brindusa (#1)

Re: Return of the pg_wal issue..

On Wed, 2025-01-22 at 17:33 +0000, Paul Brindusa wrote:

So I want to try and get a solution for this and fix this issue with the pg_wal
files filling up the drive at a ridiculous rate. I have been manually moving
logs to a different directory but have had no luck in finding an actual solution.

The cluster is a 3 node cluster with HA which is running wirth patroni.

Please help me out, I will mention that I have test cluster spun up in case
something needs testing.

Also want to give a shout out to Lorenz Albe's for posting stuff about wal
files on his company blog.

I guess you are referring to
https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/

I listed all the reasons I know for your predicament.
Did you do some research along these lines?
If yes, what did you find?

Yours,
Laurenz Albe

Paul Brindusa

paulbrindusa88@gmail.com

about 1 year ago

In reply to: Adrian Klaver (#2)

Re: Return of the pg_wal issue..

Hopefully the below is going to give a little bit more insight on the issue.
I will mention as well that the cluster also replicates data to another
mysql database if it's relevant at all.
Also worth noting this is our production cluster and we have another
pre-production cluster with basically the same settings and the issue there
does not occur.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

postgres (PostgreSQL) 15.10

2) The Patroni version.

patroni 4.0.4

3) The Patroni configuration.

scope: postgres-cluster
name: db01
namespace: /service/

log:
level: INFO
traceback_level: ERROR
format: "%(asctime)s %(levelname)s: %(message)s"
dateformat: ""
max_queue_size: 1000
dir: /var/log/patroni
file_num: 4
file_size: 25000000
loggers:
patroni.postmaster: WARNING
urllib3: WARNING

restapi:
listen: x.x.x.98:8008
connect_address: x.x.x.98:8008

etcd3:
hosts: db01.local:2379,db02.local:2379,db03.local:2379

bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 500
superuser_reserved_connections: 5
password_encryption: scram-sha-256
max_locks_per_transaction: 512
max_prepared_transactions: 0
huge_pages: try
shared_buffers: 128MB
effective_cache_size: 4GB
work_mem: 128MB
maintenance_work_mem: 256MB
checkpoint_timeout: 15min
checkpoint_completion_target: 0.9
min_wal_size: 80MB
max_wal_size: 1GB
wal_buffers: 32MB
default_statistics_target: 1000
seq_page_cost: 1
random_page_cost: 4
effective_io_concurrency: 2
synchronous_commit: on
autovacuum: on
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.01
autovacuum_analyze_scale_factor: 0.01
autovacuum_vacuum_cost_limit: 500
autovacuum_vacuum_cost_delay: 2
autovacuum_naptime: 1s
max_files_per_process: 4096
archive_mode: on
archive_timeout: 1800s
archive_command: cd .
wal_level: replica
wal_keep_size: 2GB
max_wal_senders: 10
max_replication_slots: 10
hot_standby: on
wal_log_hints: on
wal_compression: on
shared_preload_libraries: pgaudit
track_io_timing: on
log_lock_waits: on
log_temp_files: 0
track_activities: on
track_counts: on
track_functions: all
log_checkpoints: on
logging_collector: on
log_truncate_on_rotation: on
log_rotation_age: 1d
log_rotation_size: 1GB
log_line_prefix: '%m [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h '
log_filename: postgresql-%Y-%m-%d.log
log_directory: /var/log/pgsql
log_connections: on
log_disconnections: on
log_statement: ddl
log_error_verbosity: verbose
hot_standby_feedback: on
max_standby_streaming_delay: 30s
wal_receiver_status_interval: 10s
idle_in_transaction_session_timeout: 10min
jit: off
max_worker_processes: 24
max_parallel_workers: 8
max_parallel_workers_per_gather: 2
max_parallel_maintenance_workers: 2

initdb:
- encoding: UTF8
- data-checksums

pg_hba:
- host replication replicator 127.0.0.1/32 md5

- host replication replicator x.x.x.98/27 scram-sha-256

- host replication replicator x.x.x.99/27 scram-sha-256

- host replication replicator x.x.x.100/27 scram-sha-256

- host all all 0.0.0.0/0 md5

postgresql:
listen: x.x.x.98:5432
connect_address: x.x.x.98:5432
data_dir: /var/lib/pgsql/data
bin_dir: /usr/bin
pgpass: /var/lib/pgsql/.pgpass_patroni
authentication:
replication:
username: replicator
password: password
superuser:
username: postgres
password: password
parameters:
unix_socket_directories: /var/run/postgresql

remove_data_directory_on_rewind_failure: false
remove_data_directory_on_diverged_timelines: false

create_replica_methods:
- basebackup
basebackup:
max-rate: '100M'
checkpoint: 'fast'

watchdog:
mode: required
device: /dev/watchdog
safety_margin: 5

tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false

4) Definition of 'ridiculous rate'.

1GB / day

5) Relevant information from the logs.

Below entry is something taken off today's log until this point in time
which I think it might be relevant. I cannot see any specifics. If there is
anything else please let me know.

2<REDACTED>:<REDACTED> GMT [186889]: [863-1] db=,user=,app=,client= LOG:
00000: checkpoint starting: time
2<REDACTED>:<REDACTED> GMT [186889]: [864-1] db=,user=,app=,client=
LOCATION: LogCheckpointStart, xlog.c:6121
2<REDACTED>:<REDACTED> GMT [186889]: [865-1] db=,user=,app=,client= LOG:
00000: checkpoint complete: wrote 66 buffers (0.4%); 0 WAL file(s) added,
0 removed, 0 recycled; write=6.563 s, sync=0.003 s, total=6.619 s; sync
files=22, longest=0.002 s, average=0.001 s; distance=776 kB, estimate=56426
kB
2<REDACTED>:<REDACTED> GMT [186889]: [866-1] db=,user=,app=,client=
LOCATION: LogCheckpointEnd, xlog.c:6202
2<REDACTED>:<REDACTED> GMT [2439188]: [7-1]
db=documentation-database,user=documentation-database-user,app=PostgreSQL
JDBC Driver,client=<REDACTED> LOG: 00000: disconnection: session time:
0:<REDACTED> user=documentation-database-user
database=documentation-database host=<REDACTED> port=56170

@Laurenz

I guess you are referring to
https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/

*Yes, that is the one.*

I listed all the reasons I know for your predicament.
Did you do some research along these lines?

*I've had a look at the things that you have mentioned in the guide. *

If yes, what did you find?

*I've not managed to test the queries out yet. But I am planning to test
out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases.*

*Hope the above is going to give a bit of insight on the root cause of the
problem.*

Yours,
Laurenz Albe

On Wed, Jan 22, 2025 at 6:03 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 1/22/25 09:33, Paul Brindusa wrote:

Good afternoon,

Following below we are facing a similar issue and im getting a real buzz
to get this working myself, speaking to my DBA in the company has
actually left me a bit cold as he is not good with postgres.

So I want to try and get a solution for this and fix this issue with the
pg_wal files filling up the drive at a ridiculous rate. I have been
manually moving logs to a different directory but have had no luck in
finding an actual solution.

The cluster is a 3 node cluster with HA which is running wirth patroni.

Please help me out, I will mention that I have test cluster spun up in
case something needs testing.

Also want to give a shout out to Lorenz Albe's for posting stuff about
wal files on his company blog.

Again any help will be greatly appreciated.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

2) The Patroni version.

3) The Patroni configuration.

4) Definition of 'ridiculous rate'.

5) Relevant information from the logs.

" On one of our postgres instances we have the pg_wal/data folder up to
196GB, out of 200GB disk filled up.
This has stopped the posgresql.service this morning causing two
applications to crash.
Unfortunately our database admin is on leave today, and we are trying to
figure out how to get the disk down?
Any ideas or suggestions are more than welcome.

Thank you in advance."

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com <mailto:paulbrindusa88@gmail.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

Saul Perdomo

saul.perdomo@gmail.com

about 1 year ago

In reply to: Paul Brindusa (#4)

Re: Return of the pg_wal issue..

Hey Paul,

Regarding

*"I've not managed to test the queries out yet. But I am planning to test
out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases."*

As a dispassionate third-party observer, I can confirm that all SELECT and
SHOW queries from Laurenz's blog post are read-only. They're completely
safe to run in the affected environment.

On Thu, Jan 23, 2025 at 6:40 AM Paul Brindusa <paulbrindusa88@gmail.com>
wrote:

Show quoted text

Hopefully the below is going to give a little bit more insight on the
issue.
I will mention as well that the cluster also replicates data to another
mysql database if it's relevant at all.
Also worth noting this is our production cluster and we have another
pre-production cluster with basically the same settings and the issue there
does not occur.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

postgres (PostgreSQL) 15.10

2) The Patroni version.

patroni 4.0.4

3) The Patroni configuration.

scope: postgres-cluster
name: db01
namespace: /service/

log:
level: INFO
traceback_level: ERROR
format: "%(asctime)s %(levelname)s: %(message)s"
dateformat: ""
max_queue_size: 1000
dir: /var/log/patroni
file_num: 4
file_size: 25000000
loggers:
patroni.postmaster: WARNING
urllib3: WARNING

restapi:
listen: x.x.x.98:8008
connect_address: x.x.x.98:8008

etcd3:
hosts: db01.local:2379,db02.local:2379,db03.local:2379

bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 500
superuser_reserved_connections: 5
password_encryption: scram-sha-256
max_locks_per_transaction: 512
max_prepared_transactions: 0
huge_pages: try
shared_buffers: 128MB
effective_cache_size: 4GB
work_mem: 128MB
maintenance_work_mem: 256MB
checkpoint_timeout: 15min
checkpoint_completion_target: 0.9
min_wal_size: 80MB
max_wal_size: 1GB
wal_buffers: 32MB
default_statistics_target: 1000
seq_page_cost: 1
random_page_cost: 4
effective_io_concurrency: 2
synchronous_commit: on
autovacuum: on
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.01
autovacuum_analyze_scale_factor: 0.01
autovacuum_vacuum_cost_limit: 500
autovacuum_vacuum_cost_delay: 2
autovacuum_naptime: 1s
max_files_per_process: 4096
archive_mode: on
archive_timeout: 1800s
archive_command: cd .
wal_level: replica
wal_keep_size: 2GB
max_wal_senders: 10
max_replication_slots: 10
hot_standby: on
wal_log_hints: on
wal_compression: on
shared_preload_libraries: pgaudit
track_io_timing: on
log_lock_waits: on
log_temp_files: 0
track_activities: on
track_counts: on
track_functions: all
log_checkpoints: on
logging_collector: on
log_truncate_on_rotation: on
log_rotation_age: 1d
log_rotation_size: 1GB
log_line_prefix: '%m [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h '
log_filename: postgresql-%Y-%m-%d.log
log_directory: /var/log/pgsql
log_connections: on
log_disconnections: on
log_statement: ddl
log_error_verbosity: verbose
hot_standby_feedback: on
max_standby_streaming_delay: 30s
wal_receiver_status_interval: 10s
idle_in_transaction_session_timeout: 10min
jit: off
max_worker_processes: 24
max_parallel_workers: 8
max_parallel_workers_per_gather: 2
max_parallel_maintenance_workers: 2

initdb:
- encoding: UTF8
- data-checksums

pg_hba:
- host replication replicator 127.0.0.1/32 md5

- host replication replicator x.x.x.98/27 scram-sha-256

- host replication replicator x.x.x.99/27 scram-sha-256

- host replication replicator x.x.x.100/27 scram-sha-256

- host all all 0.0.0.0/0 md5

postgresql:
listen: x.x.x.98:5432
connect_address: x.x.x.98:5432
data_dir: /var/lib/pgsql/data
bin_dir: /usr/bin
pgpass: /var/lib/pgsql/.pgpass_patroni
authentication:
replication:
username: replicator
password: password
superuser:
username: postgres
password: password
parameters:
unix_socket_directories: /var/run/postgresql

remove_data_directory_on_rewind_failure: false
remove_data_directory_on_diverged_timelines: false

create_replica_methods:
- basebackup
basebackup:
max-rate: '100M'
checkpoint: 'fast'

watchdog:
mode: required
device: /dev/watchdog
safety_margin: 5

tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false

4) Definition of 'ridiculous rate'.

1GB / day

5) Relevant information from the logs.

Below entry is something taken off today's log until this point in time
which I think it might be relevant. I cannot see any specifics. If there is
anything else please let me know.

2<REDACTED>:<REDACTED> GMT [186889]: [863-1] db=,user=,app=,client= LOG:
00000: checkpoint starting: time
2<REDACTED>:<REDACTED> GMT [186889]: [864-1] db=,user=,app=,client=
LOCATION: LogCheckpointStart, xlog.c:6121
2<REDACTED>:<REDACTED> GMT [186889]: [865-1] db=,user=,app=,client= LOG:
00000: checkpoint complete: wrote 66 buffers (0.4%); 0 WAL file(s) added,
0 removed, 0 recycled; write=6.563 s, sync=0.003 s, total=6.619 s; sync
files=22, longest=0.002 s, average=0.001 s; distance=776 kB, estimate=56426
kB
2<REDACTED>:<REDACTED> GMT [186889]: [866-1] db=,user=,app=,client=
LOCATION: LogCheckpointEnd, xlog.c:6202
2<REDACTED>:<REDACTED> GMT [2439188]: [7-1]
db=documentation-database,user=documentation-database-user,app=PostgreSQL
JDBC Driver,client=<REDACTED> LOG: 00000: disconnection: session time:
0:<REDACTED> user=documentation-database-user
database=documentation-database host=<REDACTED> port=56170

@Laurenz

I guess you are referring to
https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/

*Yes, that is the one.*

I listed all the reasons I know for your predicament.
Did you do some research along these lines?

*I've had a look at the things that you have mentioned in the guide. *

If yes, what did you find?

*I've not managed to test the queries out yet. But I am planning to test
out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases.*

*Hope the above is going to give a bit of insight on the root cause of the
problem.*

Yours,
Laurenz Albe

On Wed, Jan 22, 2025 at 6:03 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 1/22/25 09:33, Paul Brindusa wrote:

Good afternoon,

Following below we are facing a similar issue and im getting a real

buzz

to get this working myself, speaking to my DBA in the company has
actually left me a bit cold as he is not good with postgres.

So I want to try and get a solution for this and fix this issue with

the

pg_wal files filling up the drive at a ridiculous rate. I have been
manually moving logs to a different directory but have had no luck in
finding an actual solution.

The cluster is a 3 node cluster with HA which is running wirth patroni.

Please help me out, I will mention that I have test cluster spun up in
case something needs testing.

Also want to give a shout out to Lorenz Albe's for posting stuff about
wal files on his company blog.

Again any help will be greatly appreciated.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

2) The Patroni version.

3) The Patroni configuration.

4) Definition of 'ridiculous rate'.

5) Relevant information from the logs.

" On one of our postgres instances we have the pg_wal/data folder up to
196GB, out of 200GB disk filled up.
This has stopped the posgresql.service this morning causing two
applications to crash.
Unfortunately our database admin is on leave today, and we are trying

to

figure out how to get the disk down?
Any ideas or suggestions are more than welcome.

Thank you in advance."

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com <mailto:paulbrindusa88@gmail.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

Paul Brindusa

paulbrindusa88@gmail.com

about 1 year ago

In reply to: Saul Perdomo (#5)

Re: Return of the pg_wal issue..

Hi Saul,

Fantastic, highly appreciate that.
Initially i've taken the top bit of Laurenz's post to understand what
these wal files are.
Any help is more than welcome, apologies for not checking that bit earlier.
In the meantime i've checked those queries as well.

Thank you

On Thu, Jan 23, 2025 at 2:27 PM Saul Perdomo <saul.perdomo@gmail.com> wrote:

Hey Paul,

Regarding

*"I've not managed to test the queries out yet. But I am planning to test
out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases."*

As a dispassionate third-party observer, I can confirm that all SELECT and
SHOW queries from Laurenz's blog post are read-only. They're completely
safe to run in the affected environment.

On Thu, Jan 23, 2025 at 6:40 AM Paul Brindusa <paulbrindusa88@gmail.com>
wrote:

Hopefully the below is going to give a little bit more insight on the
issue.
I will mention as well that the cluster also replicates data to another
mysql database if it's relevant at all.
Also worth noting this is our production cluster and we have another
pre-production cluster with basically the same settings and the issue there
does not occur.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

postgres (PostgreSQL) 15.10

2) The Patroni version.

patroni 4.0.4

3) The Patroni configuration.

scope: postgres-cluster
name: db01
namespace: /service/

log:
level: INFO
traceback_level: ERROR
format: "%(asctime)s %(levelname)s: %(message)s"
dateformat: ""
max_queue_size: 1000
dir: /var/log/patroni
file_num: 4
file_size: 25000000
loggers:
patroni.postmaster: WARNING
urllib3: WARNING

restapi:
listen: x.x.x.98:8008
connect_address: x.x.x.98:8008

etcd3:
hosts: db01.local:2379,db02.local:2379,db03.local:2379

bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 500
superuser_reserved_connections: 5
password_encryption: scram-sha-256
max_locks_per_transaction: 512
max_prepared_transactions: 0
huge_pages: try
shared_buffers: 128MB
effective_cache_size: 4GB
work_mem: 128MB
maintenance_work_mem: 256MB
checkpoint_timeout: 15min
checkpoint_completion_target: 0.9
min_wal_size: 80MB
max_wal_size: 1GB
wal_buffers: 32MB
default_statistics_target: 1000
seq_page_cost: 1
random_page_cost: 4
effective_io_concurrency: 2
synchronous_commit: on
autovacuum: on
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.01
autovacuum_analyze_scale_factor: 0.01
autovacuum_vacuum_cost_limit: 500
autovacuum_vacuum_cost_delay: 2
autovacuum_naptime: 1s
max_files_per_process: 4096
archive_mode: on
archive_timeout: 1800s
archive_command: cd .
wal_level: replica
wal_keep_size: 2GB
max_wal_senders: 10
max_replication_slots: 10
hot_standby: on
wal_log_hints: on
wal_compression: on
shared_preload_libraries: pgaudit
track_io_timing: on
log_lock_waits: on
log_temp_files: 0
track_activities: on
track_counts: on
track_functions: all
log_checkpoints: on
logging_collector: on
log_truncate_on_rotation: on
log_rotation_age: 1d
log_rotation_size: 1GB
log_line_prefix: '%m [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h '
log_filename: postgresql-%Y-%m-%d.log
log_directory: /var/log/pgsql
log_connections: on
log_disconnections: on
log_statement: ddl
log_error_verbosity: verbose
hot_standby_feedback: on
max_standby_streaming_delay: 30s
wal_receiver_status_interval: 10s
idle_in_transaction_session_timeout: 10min
jit: off
max_worker_processes: 24
max_parallel_workers: 8
max_parallel_workers_per_gather: 2
max_parallel_maintenance_workers: 2

initdb:
- encoding: UTF8
- data-checksums

pg_hba:
- host replication replicator 127.0.0.1/32 md5

- host replication replicator x.x.x.98/27 scram-sha-256

- host replication replicator x.x.x.99/27 scram-sha-256

- host replication replicator x.x.x.100/27 scram-sha-256

- host all all 0.0.0.0/0 md5

postgresql:
listen: x.x.x.98:5432
connect_address: x.x.x.98:5432
data_dir: /var/lib/pgsql/data
bin_dir: /usr/bin
pgpass: /var/lib/pgsql/.pgpass_patroni
authentication:
replication:
username: replicator
password: password
superuser:
username: postgres
password: password
parameters:
unix_socket_directories: /var/run/postgresql

remove_data_directory_on_rewind_failure: false
remove_data_directory_on_diverged_timelines: false

create_replica_methods:
- basebackup
basebackup:
max-rate: '100M'
checkpoint: 'fast'

watchdog:
mode: required
device: /dev/watchdog
safety_margin: 5

tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false

4) Definition of 'ridiculous rate'.

1GB / day

5) Relevant information from the logs.

Below entry is something taken off today's log until this point in time
which I think it might be relevant. I cannot see any specifics. If there is
anything else please let me know.

2<REDACTED>:<REDACTED> GMT [186889]: [863-1] db=,user=,app=,client= LOG:
00000: checkpoint starting: time
2<REDACTED>:<REDACTED> GMT [186889]: [864-1] db=,user=,app=,client=
LOCATION: LogCheckpointStart, xlog.c:6121
2<REDACTED>:<REDACTED> GMT [186889]: [865-1] db=,user=,app=,client= LOG:
00000: checkpoint complete: wrote 66 buffers (0.4%); 0 WAL file(s) added,
0 removed, 0 recycled; write=6.563 s, sync=0.003 s, total=6.619 s; sync
files=22, longest=0.002 s, average=0.001 s; distance=776 kB, estimate=56426
kB
2<REDACTED>:<REDACTED> GMT [186889]: [866-1] db=,user=,app=,client=
LOCATION: LogCheckpointEnd, xlog.c:6202
2<REDACTED>:<REDACTED> GMT [2439188]: [7-1]
db=documentation-database,user=documentation-database-user,app=PostgreSQL
JDBC Driver,client=<REDACTED> LOG: 00000: disconnection: session time:
0:<REDACTED> user=documentation-database-user
database=documentation-database host=<REDACTED> port=56170

@Laurenz

I guess you are referring to
https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/

*Yes, that is the one.*

I listed all the reasons I know for your predicament.
Did you do some research along these lines?

*I've had a look at the things that you have mentioned in the guide. *

If yes, what did you find?

*I've not managed to test the queries out yet. But I am planning to test
out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases.*

*Hope the above is going to give a bit of insight on the root cause of
the problem.*

Yours,
Laurenz Albe

On Wed, Jan 22, 2025 at 6:03 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 1/22/25 09:33, Paul Brindusa wrote:

Good afternoon,

Following below we are facing a similar issue and im getting a real

buzz

to get this working myself, speaking to my DBA in the company has
actually left me a bit cold as he is not good with postgres.

So I want to try and get a solution for this and fix this issue with

the

pg_wal files filling up the drive at a ridiculous rate. I have been
manually moving logs to a different directory but have had no luck in
finding an actual solution.

The cluster is a 3 node cluster with HA which is running wirth patroni.

Please help me out, I will mention that I have test cluster spun up in
case something needs testing.

Also want to give a shout out to Lorenz Albe's for posting stuff about
wal files on his company blog.

Again any help will be greatly appreciated.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

2) The Patroni version.

3) The Patroni configuration.

4) Definition of 'ridiculous rate'.

5) Relevant information from the logs.

" On one of our postgres instances we have the pg_wal/data folder up

to

196GB, out of 200GB disk filled up.
This has stopped the posgresql.service this morning causing two
applications to crash.
Unfortunately our database admin is on leave today, and we are trying

to

figure out how to get the disk down?
Any ideas or suggestions are more than welcome.

Thank you in advance."

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com <mailto:paulbrindusa88@gmail.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

Saul Perdomo

saul.perdomo@gmail.com

about 1 year ago

In reply to: Paul Brindusa (#6)

Re: Return of the pg_wal issue..

In a nutshell: Remember the ticker-tape of old? (You know, the festive
paper strips tossed out of high-rises by stockbrokers, that float on down
on city parades after a world war is won in black and white movies?) Those
were market-transaction serial records.. Similar to those guys, the
write-ahead logfiles are records of every single DB action that isn't
read-only.

Now, one could think "well if I've already committed the transactions to
the database, why do I need to keep the receipts?" And, well, strictly
speaking you don't -- but you really want to, because they could serve to
replay, *in perfect order, *the database activity for any given period.
This data is *gold *when it comes to, say, recovering from disaster.

This is why everybody will tell you "don't just delete these files, archive
them properly!" Again, for operational purposes, you could just delete
them. But you really want to make a *copy *of them before you do... you
know, *just in case *something bad happens to your DB that makes you want
to roll it back in time.

Enter the "archive_command" function. Instead of simply deleting them, you
tell PG to pass the files on to another piece of software that knows what
to do with this highly valuable data. To avoid complicating matters
further, at this point I'll simply recommend that you use PGBackRest for
this, it's my favorite piece of software beside the PG server.

Or, you know, you could just delete them. You really don't want to, though.
Get PGBackRest going and the concepts will click for you as you progress
along setting it up: https://pgbackrest.org/user-guide.html

Cheers
Saul

On Thu, Jan 23, 2025 at 9:40 AM Paul Brindusa <paulbrindusa88@gmail.com>
wrote:

Show quoted text

Hi Saul,

Fantastic, highly appreciate that.
Initially i've taken the top bit of Laurenz's post to understand what
these wal files are.
Any help is more than welcome, apologies for not checking that bit
earlier. In the meantime i've checked those queries as well.

Thank you

On Thu, Jan 23, 2025 at 2:27 PM Saul Perdomo <saul.perdomo@gmail.com>
wrote:

Hey Paul,

Regarding

*"I've not managed to test the queries out yet. But I am planning to test
out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases."*

As a dispassionate third-party observer, I can confirm that all SELECT
and SHOW queries from Laurenz's blog post are read-only. They're completely
safe to run in the affected environment.

On Thu, Jan 23, 2025 at 6:40 AM Paul Brindusa <paulbrindusa88@gmail.com>
wrote:

Hopefully the below is going to give a little bit more insight on the
issue.
I will mention as well that the cluster also replicates data to another
mysql database if it's relevant at all.
Also worth noting this is our production cluster and we have another
pre-production cluster with basically the same settings and the issue there
does not occur.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

postgres (PostgreSQL) 15.10

2) The Patroni version.

patroni 4.0.4

3) The Patroni configuration.

scope: postgres-cluster
name: db01
namespace: /service/

log:
level: INFO
traceback_level: ERROR
format: "%(asctime)s %(levelname)s: %(message)s"
dateformat: ""
max_queue_size: 1000
dir: /var/log/patroni
file_num: 4
file_size: 25000000
loggers:
patroni.postmaster: WARNING
urllib3: WARNING

restapi:
listen: x.x.x.98:8008
connect_address: x.x.x.98:8008

etcd3:
hosts: db01.local:2379,db02.local:2379,db03.local:2379

bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 500
superuser_reserved_connections: 5
password_encryption: scram-sha-256
max_locks_per_transaction: 512
max_prepared_transactions: 0
huge_pages: try
shared_buffers: 128MB
effective_cache_size: 4GB
work_mem: 128MB
maintenance_work_mem: 256MB
checkpoint_timeout: 15min
checkpoint_completion_target: 0.9
min_wal_size: 80MB
max_wal_size: 1GB
wal_buffers: 32MB
default_statistics_target: 1000
seq_page_cost: 1
random_page_cost: 4
effective_io_concurrency: 2
synchronous_commit: on
autovacuum: on
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.01
autovacuum_analyze_scale_factor: 0.01
autovacuum_vacuum_cost_limit: 500
autovacuum_vacuum_cost_delay: 2
autovacuum_naptime: 1s
max_files_per_process: 4096
archive_mode: on
archive_timeout: 1800s
archive_command: cd .
wal_level: replica
wal_keep_size: 2GB
max_wal_senders: 10
max_replication_slots: 10
hot_standby: on
wal_log_hints: on
wal_compression: on
shared_preload_libraries: pgaudit
track_io_timing: on
log_lock_waits: on
log_temp_files: 0
track_activities: on
track_counts: on
track_functions: all
log_checkpoints: on
logging_collector: on
log_truncate_on_rotation: on
log_rotation_age: 1d
log_rotation_size: 1GB
log_line_prefix: '%m [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h
'
log_filename: postgresql-%Y-%m-%d.log
log_directory: /var/log/pgsql
log_connections: on
log_disconnections: on
log_statement: ddl
log_error_verbosity: verbose
hot_standby_feedback: on
max_standby_streaming_delay: 30s
wal_receiver_status_interval: 10s
idle_in_transaction_session_timeout: 10min
jit: off
max_worker_processes: 24
max_parallel_workers: 8
max_parallel_workers_per_gather: 2
max_parallel_maintenance_workers: 2

initdb:
- encoding: UTF8
- data-checksums

pg_hba:
- host replication replicator 127.0.0.1/32 md5

- host replication replicator x.x.x.98/27 scram-sha-256

- host replication replicator x.x.x.99/27 scram-sha-256

- host replication replicator x.x.x.100/27 scram-sha-256

- host all all 0.0.0.0/0 md5

postgresql:
listen: x.x.x.98:5432
connect_address: x.x.x.98:5432
data_dir: /var/lib/pgsql/data
bin_dir: /usr/bin
pgpass: /var/lib/pgsql/.pgpass_patroni
authentication:
replication:
username: replicator
password: password
superuser:
username: postgres
password: password
parameters:
unix_socket_directories: /var/run/postgresql

remove_data_directory_on_rewind_failure: false
remove_data_directory_on_diverged_timelines: false

create_replica_methods:
- basebackup
basebackup:
max-rate: '100M'
checkpoint: 'fast'

watchdog:
mode: required
device: /dev/watchdog
safety_margin: 5

tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false

4) Definition of 'ridiculous rate'.

1GB / day

5) Relevant information from the logs.

Below entry is something taken off today's log until this point in time
which I think it might be relevant. I cannot see any specifics. If there is
anything else please let me know.

2<REDACTED>:<REDACTED> GMT [186889]: [863-1] db=,user=,app=,client= LOG:
00000: checkpoint starting: time
2<REDACTED>:<REDACTED> GMT [186889]: [864-1] db=,user=,app=,client=
LOCATION: LogCheckpointStart, xlog.c:6121
2<REDACTED>:<REDACTED> GMT [186889]: [865-1] db=,user=,app=,client= LOG:
00000: checkpoint complete: wrote 66 buffers (0.4%); 0 WAL file(s) added,
0 removed, 0 recycled; write=6.563 s, sync=0.003 s, total=6.619 s; sync
files=22, longest=0.002 s, average=0.001 s; distance=776 kB, estimate=56426
kB
2<REDACTED>:<REDACTED> GMT [186889]: [866-1] db=,user=,app=,client=
LOCATION: LogCheckpointEnd, xlog.c:6202
2<REDACTED>:<REDACTED> GMT [2439188]: [7-1]
db=documentation-database,user=documentation-database-user,app=PostgreSQL
JDBC Driver,client=<REDACTED> LOG: 00000: disconnection: session time:
0:<REDACTED> user=documentation-database-user
database=documentation-database host=<REDACTED> port=56170

@Laurenz

I guess you are referring to
https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/

*Yes, that is the one.*

I listed all the reasons I know for your predicament.
Did you do some research along these lines?

*I've had a look at the things that you have mentioned in the guide. *

If yes, what did you find?

*I've not managed to test the queries out yet. But I am planning to test
out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases.*

*Hope the above is going to give a bit of insight on the root cause of
the problem.*

Yours,
Laurenz Albe

On Wed, Jan 22, 2025 at 6:03 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 1/22/25 09:33, Paul Brindusa wrote:

Good afternoon,

Following below we are facing a similar issue and im getting a real

buzz

to get this working myself, speaking to my DBA in the company has
actually left me a bit cold as he is not good with postgres.

So I want to try and get a solution for this and fix this issue with

the

pg_wal files filling up the drive at a ridiculous rate. I have been
manually moving logs to a different directory but have had no luck in
finding an actual solution.

The cluster is a 3 node cluster with HA which is running wirth

patroni.

Please help me out, I will mention that I have test cluster spun up

in

case something needs testing.

Also want to give a shout out to Lorenz Albe's for posting stuff

about

wal files on his company blog.

Again any help will be greatly appreciated.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

2) The Patroni version.

3) The Patroni configuration.

4) Definition of 'ridiculous rate'.

5) Relevant information from the logs.

" On one of our postgres instances we have the pg_wal/data folder up

to

196GB, out of 200GB disk filled up.
This has stopped the posgresql.service this morning causing two
applications to crash.
Unfortunately our database admin is on leave today, and we are trying

to

figure out how to get the disk down?
Any ideas or suggestions are more than welcome.

Thank you in advance."

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com <mailto:paulbrindusa88@gmail.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

Paul Brindusa

paulbrindusa88@gmail.com

about 1 year ago

In reply to: Saul Perdomo (#7)

Re: Return of the pg_wal issue..

Wow, hats off to you kind sir!
Will defo look into this. Thank you for your support with this.

On Thu, Jan 23, 2025 at 2:51 PM Saul Perdomo <saul.perdomo@gmail.com> wrote:

In a nutshell: Remember the ticker-tape of old? (You know, the festive
paper strips tossed out of high-rises by stockbrokers, that float on down
on city parades after a world war is won in black and white movies?) Those
were market-transaction serial records.. Similar to those guys, the
write-ahead logfiles are records of every single DB action that isn't
read-only.

Now, one could think "well if I've already committed the transactions to
the database, why do I need to keep the receipts?" And, well, strictly
speaking you don't -- but you really want to, because they could serve to
replay, *in perfect order, *the database activity for any given period.
This data is *gold *when it comes to, say, recovering from disaster.

This is why everybody will tell you "don't just delete these files,
archive them properly!" Again, for operational purposes, you could just
delete them. But you really want to make a *copy *of them before you
do... you know, *just in case *something bad happens to your DB that
makes you want to roll it back in time.

Enter the "archive_command" function. Instead of simply deleting them, you
tell PG to pass the files on to another piece of software that knows what
to do with this highly valuable data. To avoid complicating matters
further, at this point I'll simply recommend that you use PGBackRest for
this, it's my favorite piece of software beside the PG server.

Or, you know, you could just delete them. You really don't want to,
though. Get PGBackRest going and the concepts will click for you as you
progress along setting it up: https://pgbackrest.org/user-guide.html

Cheers
Saul

On Thu, Jan 23, 2025 at 9:40 AM Paul Brindusa <paulbrindusa88@gmail.com>
wrote:

Hi Saul,

Fantastic, highly appreciate that.
Initially i've taken the top bit of Laurenz's post to understand what
these wal files are.
Any help is more than welcome, apologies for not checking that bit
earlier. In the meantime i've checked those queries as well.

Thank you

On Thu, Jan 23, 2025 at 2:27 PM Saul Perdomo <saul.perdomo@gmail.com>
wrote:

Hey Paul,

Regarding

*"I've not managed to test the queries out yet. But I am planning to
test out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases."*

As a dispassionate third-party observer, I can confirm that all SELECT
and SHOW queries from Laurenz's blog post are read-only. They're completely
safe to run in the affected environment.

On Thu, Jan 23, 2025 at 6:40 AM Paul Brindusa <paulbrindusa88@gmail.com>
wrote:

Hopefully the below is going to give a little bit more insight on the
issue.
I will mention as well that the cluster also replicates data to another
mysql database if it's relevant at all.
Also worth noting this is our production cluster and we have another
pre-production cluster with basically the same settings and the issue there
does not occur.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

postgres (PostgreSQL) 15.10

2) The Patroni version.

patroni 4.0.4

3) The Patroni configuration.

scope: postgres-cluster
name: db01
namespace: /service/

log:
level: INFO
traceback_level: ERROR
format: "%(asctime)s %(levelname)s: %(message)s"
dateformat: ""
max_queue_size: 1000
dir: /var/log/patroni
file_num: 4
file_size: 25000000
loggers:
patroni.postmaster: WARNING
urllib3: WARNING

restapi:
listen: x.x.x.98:8008
connect_address: x.x.x.98:8008

etcd3:
hosts: db01.local:2379,db02.local:2379,db03.local:2379

bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 500
superuser_reserved_connections: 5
password_encryption: scram-sha-256
max_locks_per_transaction: 512
max_prepared_transactions: 0
huge_pages: try
shared_buffers: 128MB
effective_cache_size: 4GB
work_mem: 128MB
maintenance_work_mem: 256MB
checkpoint_timeout: 15min
checkpoint_completion_target: 0.9
min_wal_size: 80MB
max_wal_size: 1GB
wal_buffers: 32MB
default_statistics_target: 1000
seq_page_cost: 1
random_page_cost: 4
effective_io_concurrency: 2
synchronous_commit: on
autovacuum: on
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.01
autovacuum_analyze_scale_factor: 0.01
autovacuum_vacuum_cost_limit: 500
autovacuum_vacuum_cost_delay: 2
autovacuum_naptime: 1s
max_files_per_process: 4096
archive_mode: on
archive_timeout: 1800s
archive_command: cd .
wal_level: replica
wal_keep_size: 2GB
max_wal_senders: 10
max_replication_slots: 10
hot_standby: on
wal_log_hints: on
wal_compression: on
shared_preload_libraries: pgaudit
track_io_timing: on
log_lock_waits: on
log_temp_files: 0
track_activities: on
track_counts: on
track_functions: all
log_checkpoints: on
logging_collector: on
log_truncate_on_rotation: on
log_rotation_age: 1d
log_rotation_size: 1GB
log_line_prefix: '%m [%p]: [%l-1]
db=%d,user=%u,app=%a,client=%h '
log_filename: postgresql-%Y-%m-%d.log
log_directory: /var/log/pgsql
log_connections: on
log_disconnections: on
log_statement: ddl
log_error_verbosity: verbose
hot_standby_feedback: on
max_standby_streaming_delay: 30s
wal_receiver_status_interval: 10s
idle_in_transaction_session_timeout: 10min
jit: off
max_worker_processes: 24
max_parallel_workers: 8
max_parallel_workers_per_gather: 2
max_parallel_maintenance_workers: 2

initdb:
- encoding: UTF8
- data-checksums

pg_hba:
- host replication replicator 127.0.0.1/32 md5

- host replication replicator x.x.x.98/27 scram-sha-256

- host replication replicator x.x.x.99/27 scram-sha-256

- host replication replicator x.x.x.100/27 scram-sha-256

- host all all 0.0.0.0/0 md5

postgresql:
listen: x.x.x.98:5432
connect_address: x.x.x.98:5432
data_dir: /var/lib/pgsql/data
bin_dir: /usr/bin
pgpass: /var/lib/pgsql/.pgpass_patroni
authentication:
replication:
username: replicator
password: password
superuser:
username: postgres
password: password
parameters:
unix_socket_directories: /var/run/postgresql

remove_data_directory_on_rewind_failure: false
remove_data_directory_on_diverged_timelines: false

create_replica_methods:
- basebackup
basebackup:
max-rate: '100M'
checkpoint: 'fast'

watchdog:
mode: required
device: /dev/watchdog
safety_margin: 5

tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false

4) Definition of 'ridiculous rate'.

1GB / day

5) Relevant information from the logs.

Below entry is something taken off today's log until this point in
time which I think it might be relevant. I cannot see any specifics. If
there is anything else please let me know.

2<REDACTED>:<REDACTED> GMT [186889]: [863-1] db=,user=,app=,client=
LOG: 00000: checkpoint starting: time
2<REDACTED>:<REDACTED> GMT [186889]: [864-1] db=,user=,app=,client=
LOCATION: LogCheckpointStart, xlog.c:6121
2<REDACTED>:<REDACTED> GMT [186889]: [865-1] db=,user=,app=,client=
LOG: 00000: checkpoint complete: wrote 66 buffers (0.4%); 0 WAL file(s)
added, 0 removed, 0 recycled; write=6.563 s, sync=0.003 s, total=6.619 s;
sync files=22, longest=0.002 s, average=0.001 s; distance=776 kB,
estimate=56426 kB
2<REDACTED>:<REDACTED> GMT [186889]: [866-1] db=,user=,app=,client=
LOCATION: LogCheckpointEnd, xlog.c:6202
2<REDACTED>:<REDACTED> GMT [2439188]: [7-1]
db=documentation-database,user=documentation-database-user,app=PostgreSQL
JDBC Driver,client=<REDACTED> LOG: 00000: disconnection: session time:
0:<REDACTED> user=documentation-database-user
database=documentation-database host=<REDACTED> port=56170

@Laurenz

I guess you are referring to
https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/

*Yes, that is the one.*

I listed all the reasons I know for your predicament.
Did you do some research along these lines?

*I've had a look at the things that you have mentioned in the guide. *

If yes, what did you find?

*I've not managed to test the queries out yet. But I am planning to
test out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases.*

*Hope the above is going to give a bit of insight on the root cause of
the problem.*

Yours,
Laurenz Albe

On Wed, Jan 22, 2025 at 6:03 PM Adrian Klaver <
adrian.klaver@aklaver.com> wrote:

On 1/22/25 09:33, Paul Brindusa wrote:

Good afternoon,

Following below we are facing a similar issue and im getting a real

buzz

to get this working myself, speaking to my DBA in the company has
actually left me a bit cold as he is not good with postgres.

So I want to try and get a solution for this and fix this issue with

the

pg_wal files filling up the drive at a ridiculous rate. I have been
manually moving logs to a different directory but have had no luck

in

finding an actual solution.

The cluster is a 3 node cluster with HA which is running wirth

patroni.

Please help me out, I will mention that I have test cluster spun up

in

case something needs testing.

Also want to give a shout out to Lorenz Albe's for posting stuff

about

wal files on his company blog.

Again any help will be greatly appreciated.

A good deal more information is needed to troubleshoot this:

1) Postgres version(s).

2) The Patroni version.

3) The Patroni configuration.

4) Definition of 'ridiculous rate'.

5) Relevant information from the logs.

" On one of our postgres instances we have the pg_wal/data folder up

to

196GB, out of 200GB disk filled up.
This has stopped the posgresql.service this morning causing two
applications to crash.
Unfortunately our database admin is on leave today, and we are

trying to

figure out how to get the disk down?
Any ideas or suggestions are more than welcome.

Thank you in advance."

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com <mailto:paulbrindusa88@gmail.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com

Adrian Klaver

adrian.klaver@aklaver.com

about 1 year ago

In reply to: Saul Perdomo (#7)

Re: Return of the pg_wal issue..

On 1/23/25 06:51, Saul Perdomo wrote:

This is why everybody will tell you "don't just delete these files,
archive them properly!" Again, for operational purposes, you could just
delete them. But you really want to make a /copy /of them before you
do... you know, /just in case /something bad happens to your DB that
makes you want to roll it back in time.

No you can't just delete them for operational purposes without knowledge
of whether they are still needed or not.

Per:

https://www.postgresql.org/docs/current/wal-intro.html

and

https://www.postgresql.org/docs/current/wal-configuration.html

Short version, a WAL file must remain until a checkpoint is done that
makes it's content no longer needed.

Cheers
Saul

--
Adrian Klaver
adrian.klaver@aklaver.com

#10

Adrian Klaver

adrian.klaver@aklaver.com

about 1 year ago

In reply to: Paul Brindusa (#4)

Re: Return of the pg_wal issue..

On 1/23/25 03:40, Paul Brindusa wrote:

Hopefully the below is going to give a little bit more insight on the issue.
I will mention as well that the cluster also replicates data to another
mysql database if it's relevant at all.

Yeah, how is that done?

Also worth noting this is our production cluster and we have another
pre-production cluster with basically the same settings and the issue
there does not occur.

And the difference in settings is?

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com <mailto:paulbrindusa88@gmail.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#11

Saul Perdomo

saul.perdomo@gmail.com

about 1 year ago

In reply to: Adrian Klaver (#9)

Re: Return of the pg_wal issue..

Thanks for the correction Adrian - my oversimplification went too far, and
into "plain wrong" territory.

(The detail that I felt was too much for this explanation was: "and the way
to simply get rid of them would be to set your archive command to
'/bin/true', say".. but didn't want to make it seem like I was suggesting
Paul do that)

On Thu, Jan 23, 2025, 11:07 a.m. Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 1/23/25 06:51, Saul Perdomo wrote:

This is why everybody will tell you "don't just delete these files,
archive them properly!" Again, for operational purposes, you could just
delete them. But you really want to make a /copy /of them before you
do... you know, /just in case /something bad happens to your DB that
makes you want to roll it back in time.

No you can't just delete them for operational purposes without knowledge
of whether they are still needed or not.

Per:

https://www.postgresql.org/docs/current/wal-intro.html

and

https://www.postgresql.org/docs/current/wal-configuration.html

Short version, a WAL file must remain until a checkpoint is done that
makes it's content no longer needed.

Cheers
Saul

--
Adrian Klaver
adrian.klaver@aklaver.com

#12

Paul Brindusa

paulbrindusa88@gmail.com

about 1 year ago

In reply to: Saul Perdomo (#11)

Re: Return of the pg_wal issue..

Good morning everyone,

Following some troubleshooting last night we have managed to resolve the
issue.
Plowing through the entire thing we have actually came to the conclusion
that the cluster is running but not replicating.
So the number one lesson learned is to always check *replication* in the
cluster, for the sake of data safety and not having to go through a
million things.

This cluster is set up without a VIP, therefore db01 will always be the
master. Having it set up this way we have found that pb_hba.conf had this
line:

host all postgres db01/cidr trust

From my understanding this means that the master was trying to replicate to
itself and not trusting the other nodes?

To fix we have put down the entire network:

host all postgres network/cidr trust

Following the config amendment we have restarted the replicas with
patronictl -c /path/patroni/config reinit <cluster> <host>

Happy to say that clean up of wal files kicked in and now are down 4% usage
of /var volume from 96%.

Now then, there is still the bit with the actual postgres logs not rotating
properly? lol, but ill leave that for another email.

Massive thank you to all of you for the support.

On Thu, Jan 23, 2025 at 7:02 PM Saul Perdomo <saul.perdomo@gmail.com> wrote:

Thanks for the correction Adrian - my oversimplification went too far, and
into "plain wrong" territory.

(The detail that I felt was too much for this explanation was: "and the
way to simply get rid of them would be to set your archive command to
'/bin/true', say".. but didn't want to make it seem like I was suggesting
Paul do that)

On Thu, Jan 23, 2025, 11:07 a.m. Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 1/23/25 06:51, Saul Perdomo wrote:

This is why everybody will tell you "don't just delete these files,
archive them properly!" Again, for operational purposes, you could just
delete them. But you really want to make a /copy /of them before you
do... you know, /just in case /something bad happens to your DB that
makes you want to roll it back in time.

No you can't just delete them for operational purposes without knowledge
of whether they are still needed or not.

Per:

https://www.postgresql.org/docs/current/wal-intro.html

and

https://www.postgresql.org/docs/current/wal-configuration.html

Short version, a WAL file must remain until a checkpoint is done that
makes it's content no longer needed.

Cheers
Saul

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Kind Regards,
Paul Brindusa
paulbrindusa88@gmail.com