Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

Started by hubert depesz lubaczewski8 months ago40 messagesgeneral
Jump to latest

Hi,
we have following situation: Pg 14.17 (yes, I know, but it can't be
upgraded now/soon), on Ubuntu focal, in AWS cloud on EC2 server using
arm64 architecture.

All works, is fine.

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

I'm talkign about "prepare" for preparing statements. binds. also
"discard all".

We have logging enabled to csv log, which offers milisecond precicision.
For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time.

And they all have insane durations - 800-1300ms for virtually anything.

After such second, everything works fine, without any problems.

Up to next case.

Unfortunately due to short duration of such things, and the fact that
literally *everything* is paused for this 1 second, it's hard to
debug/diagnose.

Servers have memory that is almost 2x total db size (200gb vs. 384gb of
ram), so disk shouldn't be an issue.

Aside from this, we don't see any other problems.

Any idea how to look at it, what to look for, to be able to diagnose the
issue?

We do use some savepoints, but it's hard to tell when exactly they
happen (we usually log only queries that take more than 250ms, and
queries that use savepoints are usually much faster).

The DB server in question has ~ 150 connections, and handles, at this
time, we had ~ 40-50 ktps.

Logging is set using:

name │ setting
═══════════════════════════════════╪════════════════════════════════
log_autovacuum_min_duration │ 0
log_checkpoints │ on
log_connections │ on
log_destination │ csvlog
log_directory │ /cache/postgres_logs
log_disconnections │ off
log_duration │ off
log_error_verbosity │ default
log_executor_stats │ off
log_file_mode │ 0600
log_filename │ postgresql-%Y-%m-%d_%H%M%S.log
log_hostname │ off
log_line_prefix │ db=%d,user=%u
log_lock_waits │ on
log_min_duration_sample │ 0
log_min_duration_statement │ 0
log_min_error_statement │ error
log_min_messages │ warning
log_parameter_max_length │ -1
log_parameter_max_length_on_error │ 0
log_parser_stats │ off
log_planner_stats │ off
log_recovery_conflict_waits │ off
log_replication_commands │ off
log_rotation_age │ 60
log_rotation_size │ 51200
log_statement │ none
log_statement_sample_rate │ 0.0001
log_statement_stats │ off
log_temp_files │ 0
log_timezone │ UTC
log_transaction_sample_rate │ 0
log_truncate_on_rotation │ off
logging_collector │ on

and it doesn't seem to show anything interesting. Checkpoints happen,
but they don't seem correlated in any way.

Any ideas?

Best regards,

depesz

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: hubert depesz lubaczewski (#1)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 8/19/25 10:21, hubert depesz lubaczewski wrote:

Hi,
we have following situation: Pg 14.17 (yes, I know, but it can't be
upgraded now/soon), on Ubuntu focal, in AWS cloud on EC2 server using
arm64 architecture.

All works, is fine.

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

Given the subject line, what you are reporting is happening on the
replica, correct?

If so where is the replica relative to the primary, in terms of network
distance?

Also what are the 'hardware' specifications on the replica instance?

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Scot Kreienkamp
Scot.Kreienkamp@la-z-boy.com
In reply to: Adrian Klaver (#2)
RE: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

匀礀渀挀栀爀漀渀漀甀猀 挀漀洀洀椀琀㼀  伀爀 愀猀礀渀挀栀爀漀渀漀甀猀㼀ഀ਀ഀ਀ഀ਀匀挀漀琀 䬀爀攀椀攀渀欀愀洀瀀 簀 䄀瀀瀀氀椀挀愀琀椀漀渀猀 䤀渀昀爀愀猀琀爀甀挀琀甀爀攀 䄀爀挀栀椀琀攀挀琀 簀 䰀愀ⴀ娀ⴀ䈀漀礀 䌀漀爀瀀漀爀愀琀攀ഀ਀⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀 ㌀ 簀 ㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀 簀 匀挀漀琀⸀䬀爀攀椀攀渀欀愀洀瀀䀀氀愀ⴀ稀ⴀ戀漀礀⸀挀漀洀ഀ਀伀渀攀 䰀愀ⴀ娀ⴀ䈀漀礀 䐀爀椀瘀攀 簀 䴀漀渀爀漀攀Ⰰ 䴀椀挀栀椀最愀渀 㐀㠀㄀㘀㈀ 簀 氀愀ⴀ稀ⴀ戀漀礀⸀挀漀洀㰀栀琀琀瀀㨀⼀⼀眀眀眀⸀氀愀ⴀ稀ⴀ戀漀礀⸀挀漀洀⼀㸀ഀ਀ 昀愀挀攀戀漀漀欀⸀挀漀洀⼀氀愀稀戀漀礀㰀栀琀琀瀀㨀⼀⼀昀愀挀攀戀漀漀欀⸀挀漀洀⼀氀愀稀戀漀礀㸀  簀 椀渀猀琀愀最爀愀洀⸀挀漀洀⼀氀愀稀戀漀礀㰀栀琀琀瀀猀㨀⼀⼀椀渀猀琀愀最爀愀洀⸀挀漀洀⼀氀愀稀戀漀礀㸀 簀 礀漀甀琀甀戀攀⸀挀漀洀⼀氀愀稀戀漀礀㰀栀琀琀瀀㨀⼀⼀礀漀甀琀甀戀攀⸀挀漀洀⼀氀愀稀戀漀礀㸀ഀ਀ഀ਀ഀ਀嬀挀椀搀㨀氀愀稀戀漀礀开㈀ ㈀㐀开椀渀挀开渀愀瘀礀开瀀洀猀㈀㄀㠀㤀开爀最戀开爀攀瘀㈀ ㈀㔀开攀㐀 愀㤀㐀昀㈀ⴀ攀㌀㐀㐀ⴀ㐀愀㐀愀ⴀ愀 ㈀挀ⴀ昀挀㌀㄀㤀㤀㘀攀㄀㈀㜀挀⸀瀀渀最崀ഀ਀ഀ਀吀栀椀猀 洀攀猀猀愀最攀 椀猀 椀渀琀攀渀搀攀搀 漀渀氀礀 昀漀爀 琀栀攀 椀渀搀椀瘀椀搀甀愀氀 漀爀 攀渀琀椀琀礀 琀漀 眀栀椀挀栀 椀琀 椀猀 愀搀搀爀攀猀猀攀搀⸀ 䤀琀 洀愀礀 挀漀渀琀愀椀渀 瀀爀椀瘀椀氀攀最攀搀Ⰰ 挀漀渀昀椀搀攀渀琀椀愀氀 椀渀昀漀爀洀愀琀椀漀渀 眀栀椀挀栀 椀猀 攀砀攀洀瀀琀 昀爀漀洀 搀椀猀挀氀漀猀甀爀攀 甀渀搀攀爀 愀瀀瀀氀椀挀愀戀氀攀 氀愀眀猀⸀ 䤀昀 礀漀甀 愀爀攀 渀漀琀 琀栀攀 椀渀琀攀渀搀攀搀 爀攀挀椀瀀椀攀渀琀Ⰰ 礀漀甀 愀爀攀 猀琀爀椀挀琀氀礀 瀀爀漀栀椀戀椀琀攀搀 昀爀漀洀 搀椀猀猀攀洀椀渀愀琀椀渀最 漀爀 搀椀猀琀爀椀戀甀琀椀渀最 琀栀椀猀 椀渀昀漀爀洀愀琀椀漀渀 ⠀漀琀栀攀爀 琀栀愀渀 琀漀 琀栀攀 椀渀琀攀渀搀攀搀 爀攀挀椀瀀椀攀渀琀⤀ 漀爀 挀漀瀀礀椀渀最 琀栀椀猀 椀渀昀漀爀洀愀琀椀漀渀⸀ 䤀昀 礀漀甀 栀愀瘀攀 爀攀挀攀椀瘀攀搀 琀栀椀猀 挀漀洀洀甀渀椀挀愀琀椀漀渀 椀渀 攀爀爀漀爀Ⰰ 瀀氀攀愀猀攀 渀漀琀椀昀礀 甀猀 椀洀洀攀搀椀愀琀攀氀礀 戀礀 攀ⴀ洀愀椀氀 漀爀 戀礀 琀攀氀攀瀀栀漀渀攀 愀琀 琀栀攀 愀戀漀瘀攀 渀甀洀戀攀爀⸀ 吀栀愀渀欀 礀漀甀⸀ഀ਀

Attachments:

winmail.datapplication/ms-tnef; name=winmail.datDownload
#4Alban Hertroys
haramrae@gmail.com
In reply to: Scot Kreienkamp (#3)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 19 Aug 2025, at 22:25, Scot Kreienkamp <Scot.Kreienkamp@la-z-boy.com> wrote:

匀礀渀挀栀爀漀渀漀甀猀 挀漀洀洀椀琀㼀  伀爀 愀猀礀渀挀栀爀漀渀漀甀猀㼀ഀ਀ഀ਀ഀ਀匀挀漀琀 䬀爀攀椀攀渀欀愀洀瀀 簀 䄀瀀瀀氀椀挀愀琀椀漀渀猀 䤀渀昀爀愀猀琀爀甀挀琀甀爀攀 䄀爀挀栀椀琀攀挀琀 簀 䰀愀ⴀ娀ⴀ䈀漀礀 䌀漀爀瀀漀爀愀琀攀ഀ਀⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀 ㌀ 簀 ㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀 簀 匀挀漀琀⸀䬀

This looks like a bug in Outlook (at work) that I ran into over 3 years ago after enabling beta-support for UTF-8 encoding… The fun part was that the messages looked perfectly fine on my end, but were gibberish on the receiving ends only. Is that bug still present perhaps?

Or is it just my end that turns Scott's (and only Scott's) messages into gibberish Chinese?

Alban Hertroys
--
There is always an exception to always.

#5Ron
ronljohnsonjr@gmail.com
In reply to: Alban Hertroys (#4)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

Scot's email wasn't gibberish to me (reading from chrome). Spaces between
every character, and 99.9689% signature block (I did the math!), but all
English (and a giant png file).

On Wed, Aug 20, 2025 at 5:41 AM Alban Hertroys <haramrae@gmail.com> wrote:

On 19 Aug 2025, at 22:25, Scot Kreienkamp <Scot.Kreienkamp@la-z-boy.com>

wrote:

匀礀渀挀栀爀漀渀漀甀猀 挀漀洀洀椀琀㼀 伀爀 愀猀礀渀挀栀爀漀渀漀甀猀㼀ഀ਀ഀ਀ഀ਀匀挀漀琀 䬀爀攀椀攀渀欀愀洀瀀 簀 䄀瀀瀀氀椀挀愀琀椀漀渀猀

䤀渀昀爀愀猀琀爀甀挀琀甀爀攀 䄀爀挀栀椀琀攀挀琀 簀 䰀愀ⴀ娀ⴀ䈀漀礀 䌀漀爀瀀漀爀愀琀攀ഀ਀⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀 ㌀ 簀
㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀 簀 匀挀漀琀⸀䬀

This looks like a bug in Outlook (at work) that I ran into over 3 years
ago after enabling beta-support for UTF-8 encoding… The fun part was that
the messages looked perfectly fine on my end, but were gibberish on the
receiving ends only. Is that bug still present perhaps?

Or is it just my end that turns Scott's (and only Scott's) messages into
gibberish Chinese?

Alban Hertroys
--
There is always an exception to always.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

In reply to: Adrian Klaver (#2)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Tue, Aug 19, 2025 at 11:39:03AM -0700, Adrian Klaver wrote:

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

Given the subject line, what you are reporting is happening on the replica,
correct?

Yes.

If so where is the replica relative to the primary, in terms of network
distance?

=$ ping -c 10 primary
reports:
10 packets transmitted, 10 received, 0% packet loss, time 9181ms
rtt min/avg/max/mdev = 0.942/0.956/0.991/0.012 ms

Also what are the 'hardware' specifications on the replica instance?

c8g.48xlarge ec2 instance. It is arm64, 192 cores, with 384 gb of ram.

As for storage, this is relatitvely slow, because this db is rather
small:
gp3 500gb volume, with 6000 iops. At no point is IO in any way close to
limits, the whole db fits easily in ram.

Best regards,

depesz

In reply to: Scot Kreienkamp (#3)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Tue, Aug 19, 2025 at 08:25:26PM +0000, Scot Kreienkamp wrote:

Synchronous commit? Or asynchronous?

Isn't this potential problem on primary, and not replica? Anyway,
synchronous commit is set to "on", and "synchronous_standby_names" is
empty.

Best regards,

depesz

#8Pawel Veselov
pawel.veselov@gmail.com
In reply to: Alban Hertroys (#4)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

匀礀渀挀栀爀漀渀漀甀猀 挀漀洀洀椀琀㼀  伀爀 愀猀礀渀挀栀爀漀渀漀甀猀㼀ഀ਀ഀ਀ഀ਀匀挀漀琀 䬀爀攀椀攀渀欀愀洀瀀 簀 䄀瀀瀀氀椀挀愀琀椀漀渀猀 䤀渀昀爀愀猀琀爀甀挀琀甀爀攀 䄀爀挀栀椀琀攀挀琀 簀 䰀愀ⴀ娀ⴀ䈀漀礀 䌀漀爀瀀漀爀愀琀攀ഀ਀⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀 ㌀ 簀 ㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀 簀 匀挀漀琀⸀䬀

This looks like a bug in Outlook (at work) that I ran into over 3 years ago after enabling beta-support for UTF-8 encoding… The fun part was that the messages looked perfectly fine on my end, but were gibberish on the receiving ends only. Is that bug still present perhaps?

Interesting.
Your email client probably incorrectly interprets UTF-16 encoding.
That's what the content charset is in Scott's email.
The first text character in Scott's email is "S" (U0053), but when
sending(?) it translates to 匀 (U5300).

Gmail web also is struggling with it, judging by the spaces I see
between the characters in the Scott's original. Those spaces are
replaced '0x00' chars from the first byte of each UTF-16 char.

Or is it just my end that turns Scott's (and only Scott's) messages into gibberish Chinese?

Alban Hertroys
--
There is always an exception to always.

--
With best of best regards
Pawel S. Veselov

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alban Hertroys (#4)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

Alban Hertroys <haramrae@gmail.com> writes:

Or is it just my end that turns Scott's (and only Scott's) messages into gibberish Chinese?

What it looks like from here is that the message body is in UTF16:
every other byte is zero. That gives my email client indigestion,
too.

regards, tom lane

#10Scot Kreienkamp
Scot.Kreienkamp@la-z-boy.com
In reply to: Alban Hertroys (#4)
RE: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

䄀瀀漀氀漀最椀攀猀 昀漀爀 琀栀攀 最椀戀戀攀爀椀猀栀 攀瘀攀爀礀漀渀攀⸀  伀甀琀氀漀漀欀 搀攀昀愀甀氀琀猀 琀漀 䠀吀䴀䰀Ⰰ 䤀 栀愀瘀攀渀✀琀 瀀漀猀琀攀搀 椀渀 瀀爀漀戀愀戀氀礀 洀漀爀攀 琀栀愀渀 愀 礀攀愀爀 愀渀搀 昀漀爀最漀琀 琀漀 猀攀琀 椀琀 琀漀 瀀氀愀椀渀 琀攀砀琀⸀  吀栀攀 猀椀最渀愀琀甀爀攀 䤀 挀愀渀✀琀 搀漀 愀渀礀琀栀椀渀最 愀戀漀甀琀Ⰰ 琀栀愀琀✀猀 愀甀琀漀洀愀琀椀挀愀氀氀礀 愀搀搀攀搀 愀昀琀攀爀 䤀 栀椀琀 猀攀渀搀⸀ഀ਀ഀ਀ഀ਀ഀ਀ഀ਀匀挀漀琀 䬀爀攀椀攀渀欀愀洀瀀 簀 䄀瀀瀀氀椀挀愀琀椀漀渀猀 䤀渀昀爀愀猀琀爀甀挀琀甀爀攀 䄀爀挀栀椀琀攀挀琀 簀 䰀愀ⴀ娀ⴀ䈀漀礀 䌀漀爀瀀漀爀愀琀攀ഀ਀伀渀攀 䰀愀ⴀ娀ⴀ䈀漀礀 䐀爀椀瘀攀 簀 䴀漀渀爀漀攀Ⰰ 䴀椀挀栀椀最愀渀 㐀㠀㄀㘀㈀ 簀 ∠ ⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀 ㌀ 簀 簀  ∠ ㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀  簀  䔀洀愀椀氀㨀 匀挀漀琀⸀䬀爀攀椀攀渀欀愀洀瀀䀀氀愀ⴀ稀ⴀ戀漀礀⸀挀漀洀ഀ਀ഀ਀ഀ਀ഀ਀吀栀椀猀 洀攀猀猀愀最攀 椀猀 椀渀琀攀渀搀攀搀 漀渀氀礀 昀漀爀 琀栀攀 椀渀搀椀瘀椀搀甀愀氀 漀爀 攀渀琀椀琀礀 琀漀 眀栀椀挀栀 椀琀 椀猀 愀搀搀爀攀猀猀攀搀⸀ 䤀琀 洀愀礀 挀漀渀琀愀椀渀 瀀爀椀瘀椀氀攀最攀搀Ⰰ 挀漀渀昀椀搀攀渀琀椀愀氀 椀渀昀漀爀洀愀琀椀漀渀 眀栀椀挀栀 椀猀 攀砀攀洀瀀琀 昀爀漀洀 搀椀猀挀氀漀猀甀爀攀 甀渀搀攀爀 愀瀀瀀氀椀挀愀戀氀攀 氀愀眀猀⸀ 䤀昀 礀漀甀 愀爀攀 渀漀琀 琀栀攀 椀渀琀攀渀搀攀搀 爀攀挀椀瀀椀攀渀琀Ⰰ 礀漀甀 愀爀攀 猀琀爀椀挀琀氀礀 瀀爀漀栀椀戀椀琀攀搀 昀爀漀洀 搀椀猀猀攀洀椀渀愀琀椀渀最 漀爀 搀椀猀琀爀椀戀甀琀椀渀最 琀栀椀猀 椀渀昀漀爀洀愀琀椀漀渀 ⠀漀琀栀攀爀 琀栀愀渀 琀漀 琀栀攀 椀渀琀攀渀搀攀搀 爀攀挀椀瀀椀攀渀琀⤀ 漀爀 挀漀瀀礀椀渀最 琀栀椀猀 椀渀昀漀爀洀愀琀椀漀渀⸀ 䤀昀 礀漀甀 栀愀瘀攀 爀攀挀攀椀瘀攀搀 琀栀椀猀 挀漀洀洀甀渀椀挀愀琀椀漀渀 椀渀 攀爀爀漀爀Ⰰ 瀀氀攀愀猀攀 渀漀琀椀昀礀 甀猀 椀洀洀攀搀椀愀琀攀氀礀 戀礀 攀ⴀ洀愀椀氀 漀爀 戀礀 琀攀氀攀瀀栀漀渀攀 愀琀 琀栀攀 愀戀漀瘀攀 渀甀洀戀攀爀⸀ 吀栀愀渀欀 礀漀甀⸀ഀ਀

Attachments:

winmail.datapplication/ms-tnef; name=winmail.datDownload
#11Adrian Klaver
adrian.klaver@aklaver.com
In reply to: hubert depesz lubaczewski (#6)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 8/20/25 04:32, hubert depesz lubaczewski wrote:

On Tue, Aug 19, 2025 at 11:39:03AM -0700, Adrian Klaver wrote:

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

Given the subject line, what you are reporting is happening on the replica,
correct?

Yes.

If so where is the replica relative to the primary, in terms of network
distance?

=$ ping -c 10 primary
reports:
10 packets transmitted, 10 received, 0% packet loss, time 9181ms
rtt min/avg/max/mdev = 0.942/0.956/0.991/0.012 ms

Also what are the 'hardware' specifications on the replica instance?

c8g.48xlarge ec2 instance. It is arm64, 192 cores, with 384 gb of ram.

As for storage, this is relatitvely slow, because this db is rather
small:
gp3 500gb volume, with 6000 iops. At no point is IO in any way close to
limits, the whole db fits easily in ram.

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Are the 30+ messages all coming in on one connection or multiple
connections?

Also to be clear these are statements that are being run on the replica
locally, correct?

Does the AWS monitoring indicate any issues?

Best regards,

depesz

--
Adrian Klaver
adrian.klaver@aklaver.com

In reply to: Adrian Klaver (#11)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Are the 30+ messages all coming in on one connection or multiple
connections?

Multiple connections.

Also to be clear these are statements that are being run on the replica
locally, correct?

What do you mean locally?

Application servers are *not* on the same server as the db. So no, they
are not local.

Does the AWS monitoring indicate any issues?

Nope. All looks "fine".

Same for our monitoring - CPU usage, iops, load.

Best regards,

depesz

#13Rob Sargent
robjsargent@gmail.com
In reply to: hubert depesz lubaczewski (#12)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Aug 20, 2025, at 10:08 AM, hubert depesz lubaczewski <depesz@depesz.com> wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Is that logger ms or db server ms? The latter seems unlikely to me.

#14Thom Brown
thom@linux.com
In reply to: hubert depesz lubaczewski (#12)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, 20 Aug 2025 at 17:08, hubert depesz lubaczewski
<depesz@depesz.com> wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Are the 30+ messages all coming in on one connection or multiple
connections?

Multiple connections.

Also to be clear these are statements that are being run on the replica
locally, correct?

What do you mean locally?

Application servers are *not* on the same server as the db. So no, they
are not local.

Does the AWS monitoring indicate any issues?

Nope. All looks "fine".

Same for our monitoring - CPU usage, iops, load.

Best regards,

depesz

Do you have THP enabled? Can you use mpstat and see what %steal shows as?

Thom

#15Adrian Klaver
adrian.klaver@aklaver.com
In reply to: hubert depesz lubaczewski (#12)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 8/20/25 09:08, hubert depesz lubaczewski wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Are the 30+ messages all coming in on one connection or multiple
connections?

Multiple connections.

Also to be clear these are statements that are being run on the replica
locally, correct?

What do you mean locally?

I should have been clearer. Are the queries being run against the
replica or the primary?

Application servers are *not* on the same server as the db. So no, they
are not local.

Where are the application servers relative to the replica server?

How many applications servers are hitting the database?

Does the AWS monitoring indicate any issues?

Nope. All looks "fine".

Same for our monitoring - CPU usage, iops, load.

Best regards,

depesz

--
Adrian Klaver
adrian.klaver@aklaver.com

#16Alban Hertroys
haramrae@gmail.com
In reply to: Tom Lane (#9)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 20 Aug 2025, at 14:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alban Hertroys <haramrae@gmail.com> writes:

Or is it just my end that turns Scott's (and only Scott's) messages into gibberish Chinese?

What it looks like from here is that the message body is in UTF16:
every other byte is zero. That gives my email client indigestion,
too.

Actually, I think it’s misreporting the message content as UTF-16, while it is in fact encoded as UTF-8. So the error appears to be on the sender side.

Apparently, Apple Mail.app is rather strict about following the reported encoding, while Gmail for example does some attempt to guess. That would explain the less terrible results with Gmail.

I can base64 decode the message just fine on the command line, which runs in a terminal emulator (Apple’s Terminal.app) that does handle UTF-8, but not UTF-16 by default.

Alban Hertroys
--
Illustratief voor het verschil tussen steek- en snijwapens:
Met mensen die een punthoofd hebben,
kun je goed de draak steken.
Met mensen die scherp zijn,
lukt dat juist slecht.

In reply to: Rob Sargent (#13)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, Aug 20, 2025 at 11:24:29AM -0600, Rob Sargent wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Is that logger ms or db server ms? The latter seems unlikely to me.

Sorry, I don't understand your question.

I know it's unlikely, so it seems that something "froze" pg, for ~ 1s.
The question is how to diagnose/debug what it is.

Best regards,

depesz

In reply to: Thom Brown (#14)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, Aug 20, 2025 at 06:30:00PM +0100, Thom Brown wrote:

Do you have THP enabled? Can you use mpstat and see what %steal shows as?

Sorry, what is "THP"? I tried searching for "what is thp", and most
common search results are related to some chemical compound.

mpstat, yes, we have it. Let me quickly show what I get:

Two days ago, at ~ 10:10pm UTC I saw this in Pg logs:

# grep -oP '^2025-08-19 22:09:2\d\.\d+ UTC' postgresql-2025-08-19_220000.csv | uniq -c | grep -C3 -P '^\s*\d\d'
2 2025-08-19 22:09:29.084 UTC
1 2025-08-19 22:09:29.094 UTC
2 2025-08-19 22:09:29.097 UTC
70 2025-08-19 22:09:29.109 UTC
90 2025-08-19 22:09:29.110 UTC
6 2025-08-19 22:09:29.111 UTC
1 2025-08-19 22:09:29.153 UTC
1 2025-08-19 22:09:29.555 UTC

As you can see we have 70, and then 90 messages all logged with the same
timestamp. All of them (160) were "duration:x" lines.

At the time, mpstat (called using `mpstat 5`) looked like this:

CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
22:09:04 all 2.25 0.00 0.27 0.24 0.00 0.10 0.00 0.00 0.00 97.14
22:09:09 all 2.70 0.00 0.28 0.27 0.00 0.12 0.00 0.00 0.00 96.63
22:09:14 all 2.83 0.00 0.63 0.23 0.00 0.09 0.00 0.00 0.00 96.22
22:09:19 all 2.66 0.00 0.28 0.23 0.00 0.10 0.00 0.00 0.00 96.74
22:09:24 all 2.36 0.00 0.27 0.21 0.00 0.10 0.00 0.00 0.00 97.06
22:09:29 all 2.49 0.00 0.40 0.40 0.00 0.11 0.00 0.00 0.00 96.60
22:09:34 all 2.65 0.00 0.28 0.23 0.00 0.11 0.00 0.00 0.00 96.73
22:09:39 all 2.26 0.00 0.26 0.21 0.00 0.09 0.00 0.00 0.00 97.19
22:09:44 all 2.31 0.00 0.25 0.22 0.00 0.09 0.00 0.00 0.00 97.13
22:09:49 all 2.44 0.00 0.22 0.24 0.00 0.08 0.00 0.00 0.00 97.02
22:09:54 all 2.23 0.00 0.23 0.28 0.00 0.10 0.00 0.00 0.00 97.16
22:09:59 all 2.48 0.00 0.26 0.30 0.00 0.11 0.00 0.00 0.00 96.85
22:10:04 all 2.29 0.00 0.28 0.26 0.00 0.12 0.00 0.00 0.00 97.06
22:10:09 all 2.31 0.00 0.22 0.21 0.00 0.11 0.00 0.00 0.00 97.14
22:10:14 all 2.38 0.00 0.25 0.21 0.00 0.10 0.00 0.00 0.00 97.05
22:10:19 all 2.54 0.00 0.24 0.20 0.00 0.11 0.00 0.00 0.00 96.91
22:10:24 all 2.26 0.00 0.25 0.22 0.00 0.10 0.00 0.00 0.00 97.17
22:10:29 all 2.12 0.00 0.27 0.24 0.00 0.09 0.00 0.00 0.00 97.27
22:10:34 all 2.56 0.00 0.26 0.26 0.00 0.12 0.00 0.00 0.00 96.80
22:10:39 all 2.16 0.00 0.23 0.24 0.00 0.10 0.00 0.00 0.00 97.27
22:10:44 all 2.37 0.00 0.26 0.22 0.00 0.10 0.00 0.00 0.00 97.05
22:10:49 all 2.25 0.00 0.25 0.61 0.00 0.09 0.00 0.00 0.00 96.80
22:10:54 all 2.41 0.00 0.28 0.22 0.00 0.10 0.00 0.00 0.00 96.99
22:10:59 all 2.83 0.00 0.29 0.19 0.00 0.12 0.00 0.00 0.00 96.57

Best regards,

depesz

In reply to: Adrian Klaver (#15)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, Aug 20, 2025 at 10:45:13AM -0700, Adrian Klaver wrote:

On 8/20/25 09:08, hubert depesz lubaczewski wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."
Are the 30+ messages all coming in on one connection or multiple
connections?

Multiple connections.

Also to be clear these are statements that are being run on the replica
locally, correct?

What do you mean locally?

I should have been clearer. Are the queries being run against the replica or
the primary?

All to replica. Primary has its own work, of course, but the problem
we're experiencing is on replicas.

Application servers are *not* on the same server as the db. So no, they
are not local.

Where are the application servers relative to the replica server?

Well, there is a lot of them. All are in AWS EC2, same region, various
availability zones. Generally we use 3 AZs, with more or less equal
split, so I'd say ~ 33% of app servers is within the same AZ.

How many applications servers are hitting the database?

To be honest, I'm not sure. I have visibility into dbs, and bouncers,
not really into Apps. I know that these are automatically dynamically
scaled, so number of app server is very varying.

I'd say anything from 40 to 200 app servers hit first layer of bouncers,
which we usually have 6-9 (2-3 per az).

These go to 2nd layer of bouncers, on the db server itself.

Best regards,

depesz

#20Thom Brown
thom@linux.com
In reply to: hubert depesz lubaczewski (#18)
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Thu, 21 Aug 2025 at 11:03, hubert depesz lubaczewski
<depesz@depesz.com> wrote:

On Wed, Aug 20, 2025 at 06:30:00PM +0100, Thom Brown wrote:

Do you have THP enabled? Can you use mpstat and see what %steal shows as?

Sorry, what is "THP"? I tried searching for "what is thp", and most
common search results are related to some chemical compound.

Ah, yeah I meant transparent hugepage:

cat /sys/kernel/mm/transparent_hugepage/enabled

This should show it being set as "never".

mpstat, yes, we have it. Let me quickly show what I get:

Two days ago, at ~ 10:10pm UTC I saw this in Pg logs:

# grep -oP '^2025-08-19 22:09:2\d\.\d+ UTC' postgresql-2025-08-19_220000.csv | uniq -c | grep -C3 -P '^\s*\d\d'
2 2025-08-19 22:09:29.084 UTC
1 2025-08-19 22:09:29.094 UTC
2 2025-08-19 22:09:29.097 UTC
70 2025-08-19 22:09:29.109 UTC
90 2025-08-19 22:09:29.110 UTC
6 2025-08-19 22:09:29.111 UTC
1 2025-08-19 22:09:29.153 UTC
1 2025-08-19 22:09:29.555 UTC

As you can see we have 70, and then 90 messages all logged with the same
timestamp. All of them (160) were "duration:x" lines.

At the time, mpstat (called using `mpstat 5`) looked like this:

CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
22:09:04 all 2.25 0.00 0.27 0.24 0.00 0.10 0.00 0.00 0.00 97.14
22:09:09 all 2.70 0.00 0.28 0.27 0.00 0.12 0.00 0.00 0.00 96.63
22:09:14 all 2.83 0.00 0.63 0.23 0.00 0.09 0.00 0.00 0.00 96.22
22:09:19 all 2.66 0.00 0.28 0.23 0.00 0.10 0.00 0.00 0.00 96.74
22:09:24 all 2.36 0.00 0.27 0.21 0.00 0.10 0.00 0.00 0.00 97.06
22:09:29 all 2.49 0.00 0.40 0.40 0.00 0.11 0.00 0.00 0.00 96.60
22:09:34 all 2.65 0.00 0.28 0.23 0.00 0.11 0.00 0.00 0.00 96.73
22:09:39 all 2.26 0.00 0.26 0.21 0.00 0.09 0.00 0.00 0.00 97.19
22:09:44 all 2.31 0.00 0.25 0.22 0.00 0.09 0.00 0.00 0.00 97.13
22:09:49 all 2.44 0.00 0.22 0.24 0.00 0.08 0.00 0.00 0.00 97.02
22:09:54 all 2.23 0.00 0.23 0.28 0.00 0.10 0.00 0.00 0.00 97.16
22:09:59 all 2.48 0.00 0.26 0.30 0.00 0.11 0.00 0.00 0.00 96.85
22:10:04 all 2.29 0.00 0.28 0.26 0.00 0.12 0.00 0.00 0.00 97.06
22:10:09 all 2.31 0.00 0.22 0.21 0.00 0.11 0.00 0.00 0.00 97.14
22:10:14 all 2.38 0.00 0.25 0.21 0.00 0.10 0.00 0.00 0.00 97.05
22:10:19 all 2.54 0.00 0.24 0.20 0.00 0.11 0.00 0.00 0.00 96.91
22:10:24 all 2.26 0.00 0.25 0.22 0.00 0.10 0.00 0.00 0.00 97.17
22:10:29 all 2.12 0.00 0.27 0.24 0.00 0.09 0.00 0.00 0.00 97.27
22:10:34 all 2.56 0.00 0.26 0.26 0.00 0.12 0.00 0.00 0.00 96.80
22:10:39 all 2.16 0.00 0.23 0.24 0.00 0.10 0.00 0.00 0.00 97.27
22:10:44 all 2.37 0.00 0.26 0.22 0.00 0.10 0.00 0.00 0.00 97.05
22:10:49 all 2.25 0.00 0.25 0.61 0.00 0.09 0.00 0.00 0.00 96.80
22:10:54 all 2.41 0.00 0.28 0.22 0.00 0.10 0.00 0.00 0.00 96.99
22:10:59 all 2.83 0.00 0.29 0.19 0.00 0.12 0.00 0.00 0.00 96.57

This output looks fine, so it doesn't show anything concerning, so
suggests the issue is somehow on the Postgres side.

Did you happen to poll pg_stat_activity at the time to see whether you
had lots of IPC waits? I'm wondering whether the storage layer is
freezing up for a moment.

Thom

In reply to: Thom Brown (#20)
#22Adrian Klaver
adrian.klaver@aklaver.com
In reply to: hubert depesz lubaczewski (#19)
In reply to: Adrian Klaver (#22)
#24Adrian Klaver
adrian.klaver@aklaver.com
In reply to: hubert depesz lubaczewski (#23)
In reply to: Adrian Klaver (#24)
#26Adrian Klaver
adrian.klaver@aklaver.com
In reply to: hubert depesz lubaczewski (#25)
#27Chris Wilson
chris+google@qwirx.com
In reply to: Adrian Klaver (#26)
In reply to: Adrian Klaver (#26)
In reply to: Chris Wilson (#27)
#30Adrian Klaver
adrian.klaver@aklaver.com
In reply to: hubert depesz lubaczewski (#28)
In reply to: Adrian Klaver (#30)
#32Pavel Stehule
pavel.stehule@gmail.com
In reply to: hubert depesz lubaczewski (#31)
In reply to: Pavel Stehule (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: hubert depesz lubaczewski (#31)
In reply to: Tom Lane (#34)
#36Adrian Klaver
adrian.klaver@aklaver.com
In reply to: hubert depesz lubaczewski (#35)
In reply to: Adrian Klaver (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: hubert depesz lubaczewski (#35)
In reply to: Tom Lane (#38)
In reply to: Tom Lane (#38)