Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

Started by hubert depesz lubaczewski10 months ago40 messagesgeneral

depesz@depesz.com

10 months ago

Hi,
we have following situation: Pg 14.17 (yes, I know, but it can't be
upgraded now/soon), on Ubuntu focal, in AWS cloud on EC2 server using
arm64 architecture.

All works, is fine.

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

I'm talkign about "prepare" for preparing statements. binds. also
"discard all".

We have logging enabled to csv log, which offers milisecond precicision.
For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time.

And they all have insane durations - 800-1300ms for virtually anything.

After such second, everything works fine, without any problems.

Up to next case.

Unfortunately due to short duration of such things, and the fact that
literally *everything* is paused for this 1 second, it's hard to
debug/diagnose.

Servers have memory that is almost 2x total db size (200gb vs. 384gb of
ram), so disk shouldn't be an issue.

Aside from this, we don't see any other problems.

Any idea how to look at it, what to look for, to be able to diagnose the
issue?

We do use some savepoints, but it's hard to tell when exactly they
happen (we usually log only queries that take more than 250ms, and
queries that use savepoints are usually much faster).

The DB server in question has ~ 150 connections, and handles, at this
time, we had ~ 40-50 ktps.

Logging is set using:

name │ setting
═══════════════════════════════════╪════════════════════════════════
log_autovacuum_min_duration │ 0
log_checkpoints │ on
log_connections │ on
log_destination │ csvlog
log_directory │ /cache/postgres_logs
log_disconnections │ off
log_duration │ off
log_error_verbosity │ default
log_executor_stats │ off
log_file_mode │ 0600
log_filename │ postgresql-%Y-%m-%d_%H%M%S.log
log_hostname │ off
log_line_prefix │ db=%d,user=%u
log_lock_waits │ on
log_min_duration_sample │ 0
log_min_duration_statement │ 0
log_min_error_statement │ error
log_min_messages │ warning
log_parameter_max_length │ -1
log_parameter_max_length_on_error │ 0
log_parser_stats │ off
log_planner_stats │ off
log_recovery_conflict_waits │ off
log_replication_commands │ off
log_rotation_age │ 60
log_rotation_size │ 51200
log_statement │ none
log_statement_sample_rate │ 0.0001
log_statement_stats │ off
log_temp_files │ 0
log_timezone │ UTC
log_transaction_sample_rate │ 0
log_truncate_on_rotation │ off
logging_collector │ on

and it doesn't seem to show anything interesting. Checkpoints happen,
but they don't seem correlated in any way.

Any ideas?

Best regards,

depesz

Adrian Klaver

adrian.klaver@aklaver.com

10 months ago

In reply to: hubert depesz lubaczewski (#1)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 8/19/25 10:21, hubert depesz lubaczewski wrote:

Hi,
we have following situation: Pg 14.17 (yes, I know, but it can't be
upgraded now/soon), on Ubuntu focal, in AWS cloud on EC2 server using
arm64 architecture.

All works, is fine.

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

Given the subject line, what you are reporting is happening on the
replica, correct?

If so where is the replica relative to the primary, in terms of network
distance?

Also what are the 'hardware' specifications on the replica instance?

--
Adrian Klaver
adrian.klaver@aklaver.com

Scot Kreienkamp

Scot.Kreienkamp@la-z-boy.com

10 months ago

In reply to: Adrian Klaver (#2)

RE: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

匀礀渀挀栀爀漀渀漀甀猀挀漀洀洀椀琀㼀伀爀愀猀礀渀挀栀爀漀渀漀甀猀㼀ഀ਀ഀ਀ഀ਀匀挀漀琀䬀爀攀椀攀渀欀愀洀瀀簀䄀瀀瀀氀椀挀愀琀椀漀渀猀䤀渀昀爀愀猀琀爀甀挀琀甀爀攀䄀爀挀栀椀琀攀挀琀簀䰀愀ⴀ娀ⴀ䈀漀礀䌀漀爀瀀漀爀愀琀攀ഀ਀⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀　㌀簀㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀簀匀挀漀琀⸀䬀爀攀椀攀渀欀愀洀瀀䀀氀愀ⴀ稀ⴀ戀漀礀⸀挀漀洀ഀ਀伀渀攀䰀愀ⴀ娀ⴀ䈀漀礀䐀爀椀瘀攀簀䴀漀渀爀漀攀Ⰰ 䴀椀挀栀椀最愀渀㐀㠀㄀㘀㈀簀氀愀ⴀ稀ⴀ戀漀礀⸀挀漀洀㰀栀琀琀瀀㨀⼀⼀眀眀眀⸀氀愀ⴀ稀ⴀ戀漀礀⸀挀漀洀⼀㸀ഀ਀ 昀愀挀攀戀漀漀欀⸀挀漀洀⼀氀愀稀戀漀礀㰀栀琀琀瀀㨀⼀⼀昀愀挀攀戀漀漀欀⸀挀漀洀⼀氀愀稀戀漀礀㸀簀椀渀猀琀愀最爀愀洀⸀挀漀洀⼀氀愀稀戀漀礀㰀栀琀琀瀀猀㨀⼀⼀椀渀猀琀愀最爀愀洀⸀挀漀洀⼀氀愀稀戀漀礀㸀簀礀漀甀琀甀戀攀⸀挀漀洀⼀氀愀稀戀漀礀㰀栀琀琀瀀㨀⼀⼀礀漀甀琀甀戀攀⸀挀漀洀⼀氀愀稀戀漀礀㸀ഀ਀ഀ਀ഀ਀嬀挀椀搀㨀氀愀稀戀漀礀开㈀　㈀㐀开椀渀挀开渀愀瘀礀开瀀洀猀㈀㄀㠀㤀开爀最戀开爀攀瘀㈀　㈀㔀开攀㐀　愀㤀㐀昀㈀ⴀ攀㌀㐀㐀ⴀ㐀愀㐀愀ⴀ愀　㈀挀ⴀ昀挀㌀㄀㤀㤀㘀攀㄀㈀㜀挀⸀瀀渀最崀ഀ਀ഀ਀吀栀椀猀洀攀猀猀愀最攀椀猀椀渀琀攀渀搀攀搀漀渀氀礀昀漀爀琀栀攀椀渀搀椀瘀椀搀甀愀氀漀爀攀渀琀椀琀礀琀漀眀栀椀挀栀椀琀椀猀愀搀搀爀攀猀猀攀搀⸀ 䤀琀洀愀礀挀漀渀琀愀椀渀瀀爀椀瘀椀氀攀最攀搀Ⰰ 挀漀渀昀椀搀攀渀琀椀愀氀椀渀昀漀爀洀愀琀椀漀渀眀栀椀挀栀椀猀攀砀攀洀瀀琀昀爀漀洀搀椀猀挀氀漀猀甀爀攀甀渀搀攀爀愀瀀瀀氀椀挀愀戀氀攀氀愀眀猀⸀ 䤀昀礀漀甀愀爀攀渀漀琀琀栀攀椀渀琀攀渀搀攀搀爀攀挀椀瀀椀攀渀琀Ⰰ 礀漀甀愀爀攀猀琀爀椀挀琀氀礀瀀爀漀栀椀戀椀琀攀搀昀爀漀洀搀椀猀猀攀洀椀渀愀琀椀渀最漀爀搀椀猀琀爀椀戀甀琀椀渀最琀栀椀猀椀渀昀漀爀洀愀琀椀漀渀 ⠀漀琀栀攀爀琀栀愀渀琀漀琀栀攀椀渀琀攀渀搀攀搀爀攀挀椀瀀椀攀渀琀⤀ 漀爀挀漀瀀礀椀渀最琀栀椀猀椀渀昀漀爀洀愀琀椀漀渀⸀ 䤀昀礀漀甀栀愀瘀攀爀攀挀攀椀瘀攀搀琀栀椀猀挀漀洀洀甀渀椀挀愀琀椀漀渀椀渀攀爀爀漀爀Ⰰ 瀀氀攀愀猀攀渀漀琀椀昀礀甀猀椀洀洀攀搀椀愀琀攀氀礀戀礀攀ⴀ洀愀椀氀漀爀戀礀琀攀氀攀瀀栀漀渀攀愀琀琀栀攀愀戀漀瘀攀渀甀洀戀攀爀⸀ 吀栀愀渀欀礀漀甀⸀ഀ਀

Alban Hertroys

haramrae@gmail.com

10 months ago

In reply to: Scot Kreienkamp (#3)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 19 Aug 2025, at 22:25, Scot Kreienkamp <Scot.Kreienkamp@la-z-boy.com> wrote:

匀礀渀挀栀爀漀渀漀甀猀挀漀洀洀椀琀㼀伀爀愀猀礀渀挀栀爀漀渀漀甀猀㼀ഀ਀ഀ਀ഀ਀匀挀漀琀䬀爀攀椀攀渀欀愀洀瀀簀䄀瀀瀀氀椀挀愀琀椀漀渀猀䤀渀昀爀愀猀琀爀甀挀琀甀爀攀䄀爀挀栀椀琀攀挀琀簀䰀愀ⴀ娀ⴀ䈀漀礀䌀漀爀瀀漀爀愀琀攀ഀ਀⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀　㌀簀㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀簀匀挀漀琀⸀䬀

This looks like a bug in Outlook (at work) that I ran into over 3 years ago after enabling beta-support for UTF-8 encoding… The fun part was that the messages looked perfectly fine on my end, but were gibberish on the receiving ends only. Is that bug still present perhaps?

Or is it just my end that turns Scott's (and only Scott's) messages into gibberish Chinese?

Alban Hertroys
--
There is always an exception to always.

Ron

ronljohnsonjr@gmail.com

10 months ago

In reply to: Alban Hertroys (#4)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

Scot's email wasn't gibberish to me (reading from chrome). Spaces between
every character, and 99.9689% signature block (I did the math!), but all
English (and a giant png file).

On Wed, Aug 20, 2025 at 5:41 AM Alban Hertroys <haramrae@gmail.com> wrote:

On 19 Aug 2025, at 22:25, Scot Kreienkamp <Scot.Kreienkamp@la-z-boy.com>

wrote:

匀礀渀挀栀爀漀渀漀甀猀挀漀洀洀椀琀㼀伀爀愀猀礀渀挀栀爀漀渀漀甀猀㼀ഀ਀ഀ਀ഀ਀匀挀漀琀䬀爀攀椀攀渀欀愀洀瀀簀䄀瀀瀀氀椀挀愀琀椀漀渀猀

䤀渀昀爀愀猀琀爀甀挀琀甀爀攀䄀爀挀栀椀琀攀挀琀簀䰀愀ⴀ娀ⴀ䈀漀礀䌀漀爀瀀漀爀愀琀攀ഀ਀⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀㌀簀
㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀簀匀挀漀琀⸀䬀

This looks like a bug in Outlook (at work) that I ran into over 3 years
ago after enabling beta-support for UTF-8 encoding… The fun part was that
the messages looked perfectly fine on my end, but were gibberish on the
receiving ends only. Is that bug still present perhaps?

Or is it just my end that turns Scott's (and only Scott's) messages into
gibberish Chinese?

Alban Hertroys
--
There is always an exception to always.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

hubert depesz lubaczewski

depesz@depesz.com

10 months ago

In reply to: Adrian Klaver (#2)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Tue, Aug 19, 2025 at 11:39:03AM -0700, Adrian Klaver wrote:

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

Given the subject line, what you are reporting is happening on the replica,
correct?

Yes.

If so where is the replica relative to the primary, in terms of network
distance?

=$ ping -c 10 primary
reports:
10 packets transmitted, 10 received, 0% packet loss, time 9181ms
rtt min/avg/max/mdev = 0.942/0.956/0.991/0.012 ms

Also what are the 'hardware' specifications on the replica instance?

c8g.48xlarge ec2 instance. It is arm64, 192 cores, with 384 gb of ram.

As for storage, this is relatitvely slow, because this db is rather
small:
gp3 500gb volume, with 6000 iops. At no point is IO in any way close to
limits, the whole db fits easily in ram.

Best regards,

depesz

hubert depesz lubaczewski

depesz@depesz.com

10 months ago

In reply to: Scot Kreienkamp (#3)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Tue, Aug 19, 2025 at 08:25:26PM +0000, Scot Kreienkamp wrote:

Synchronous commit? Or asynchronous?

Isn't this potential problem on primary, and not replica? Anyway,
synchronous commit is set to "on", and "synchronous_standby_names" is
empty.

Best regards,

depesz

Pawel Veselov

pawel.veselov@gmail.com

10 months ago

In reply to: Alban Hertroys (#4)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

匀礀渀挀栀爀漀渀漀甀猀挀漀洀洀椀琀㼀伀爀愀猀礀渀挀栀爀漀渀漀甀猀㼀ഀ਀ഀ਀ഀ਀匀挀漀琀䬀爀攀椀攀渀欀愀洀瀀簀䄀瀀瀀氀椀挀愀琀椀漀渀猀䤀渀昀爀愀猀琀爀甀挀琀甀爀攀䄀爀挀栀椀琀攀挀琀簀䰀愀ⴀ娀ⴀ䈀漀礀䌀漀爀瀀漀爀愀琀攀ഀ਀⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀　㌀簀㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀簀匀挀漀琀⸀䬀

This looks like a bug in Outlook (at work) that I ran into over 3 years ago after enabling beta-support for UTF-8 encoding… The fun part was that the messages looked perfectly fine on my end, but were gibberish on the receiving ends only. Is that bug still present perhaps?

Interesting.
Your email client probably incorrectly interprets UTF-16 encoding.
That's what the content charset is in Scott's email.
The first text character in Scott's email is "S" (U0053), but when
sending(?) it translates to 匀 (U5300).

Gmail web also is struggling with it, judging by the spaces I see
between the characters in the Scott's original. Those spaces are
replaced '0x00' chars from the first byte of each UTF-16 char.

Or is it just my end that turns Scott's (and only Scott's) messages into gibberish Chinese?

Alban Hertroys
--
There is always an exception to always.

--
With best of best regards
Pawel S. Veselov

Tom Lane

tgl@sss.pgh.pa.us

10 months ago

In reply to: Alban Hertroys (#4)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

Alban Hertroys <haramrae@gmail.com> writes:

Or is it just my end that turns Scott's (and only Scott's) messages into gibberish Chinese?

What it looks like from here is that the message body is in UTF16:
every other byte is zero. That gives my email client indigestion,
too.

regards, tom lane

#10

Scot Kreienkamp

Scot.Kreienkamp@la-z-boy.com

10 months ago

In reply to: Alban Hertroys (#4)

RE: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

䄀瀀漀氀漀最椀攀猀昀漀爀琀栀攀最椀戀戀攀爀椀猀栀攀瘀攀爀礀漀渀攀⸀ 伀甀琀氀漀漀欀搀攀昀愀甀氀琀猀琀漀䠀吀䴀䰀Ⰰ 䤀栀愀瘀攀渀✀琀瀀漀猀琀攀搀椀渀瀀爀漀戀愀戀氀礀洀漀爀攀琀栀愀渀愀礀攀愀爀愀渀搀昀漀爀最漀琀琀漀猀攀琀椀琀琀漀瀀氀愀椀渀琀攀砀琀⸀ 吀栀攀猀椀最渀愀琀甀爀攀䤀挀愀渀✀琀搀漀愀渀礀琀栀椀渀最愀戀漀甀琀Ⰰ 琀栀愀琀✀猀愀甀琀漀洀愀琀椀挀愀氀氀礀愀搀搀攀搀愀昀琀攀爀䤀栀椀琀猀攀渀搀⸀ഀ਀ഀ਀ഀ਀ഀ਀ഀ਀匀挀漀琀䬀爀攀椀攀渀欀愀洀瀀簀䄀瀀瀀氀椀挀愀琀椀漀渀猀䤀渀昀爀愀猀琀爀甀挀琀甀爀攀䄀爀挀栀椀琀攀挀琀簀䰀愀ⴀ娀ⴀ䈀漀礀䌀漀爀瀀漀爀愀琀攀ഀ਀伀渀攀䰀愀ⴀ娀ⴀ䈀漀礀䐀爀椀瘀攀簀䴀漀渀爀漀攀Ⰰ 䴀椀挀栀椀最愀渀㐀㠀㄀㘀㈀簀 ∠ ⠀㜀㌀㐀⤀ ㌀㠀㐀ⴀ㘀㐀　㌀簀簀 ∠ ㄀ⴀ㜀㌀㐀ⴀ㤀㄀㔀ⴀ㄀㐀㐀㐀簀䔀洀愀椀氀㨀匀挀漀琀⸀䬀爀攀椀攀渀欀愀洀瀀䀀氀愀ⴀ稀ⴀ戀漀礀⸀挀漀洀ഀ਀ഀ਀ഀ਀ഀ਀吀栀椀猀洀攀猀猀愀最攀椀猀椀渀琀攀渀搀攀搀漀渀氀礀昀漀爀琀栀攀椀渀搀椀瘀椀搀甀愀氀漀爀攀渀琀椀琀礀琀漀眀栀椀挀栀椀琀椀猀愀搀搀爀攀猀猀攀搀⸀ 䤀琀洀愀礀挀漀渀琀愀椀渀瀀爀椀瘀椀氀攀最攀搀Ⰰ 挀漀渀昀椀搀攀渀琀椀愀氀椀渀昀漀爀洀愀琀椀漀渀眀栀椀挀栀椀猀攀砀攀洀瀀琀昀爀漀洀搀椀猀挀氀漀猀甀爀攀甀渀搀攀爀愀瀀瀀氀椀挀愀戀氀攀氀愀眀猀⸀ 䤀昀礀漀甀愀爀攀渀漀琀琀栀攀椀渀琀攀渀搀攀搀爀攀挀椀瀀椀攀渀琀Ⰰ 礀漀甀愀爀攀猀琀爀椀挀琀氀礀瀀爀漀栀椀戀椀琀攀搀昀爀漀洀搀椀猀猀攀洀椀渀愀琀椀渀最漀爀搀椀猀琀爀椀戀甀琀椀渀最琀栀椀猀椀渀昀漀爀洀愀琀椀漀渀 ⠀漀琀栀攀爀琀栀愀渀琀漀琀栀攀椀渀琀攀渀搀攀搀爀攀挀椀瀀椀攀渀琀⤀ 漀爀挀漀瀀礀椀渀最琀栀椀猀椀渀昀漀爀洀愀琀椀漀渀⸀ 䤀昀礀漀甀栀愀瘀攀爀攀挀攀椀瘀攀搀琀栀椀猀挀漀洀洀甀渀椀挀愀琀椀漀渀椀渀攀爀爀漀爀Ⰰ 瀀氀攀愀猀攀渀漀琀椀昀礀甀猀椀洀洀攀搀椀愀琀攀氀礀戀礀攀ⴀ洀愀椀氀漀爀戀礀琀攀氀攀瀀栀漀渀攀愀琀琀栀攀愀戀漀瘀攀渀甀洀戀攀爀⸀ 吀栀愀渀欀礀漀甀⸀ഀ਀

#11

Adrian Klaver

adrian.klaver@aklaver.com

10 months ago

In reply to: hubert depesz lubaczewski (#6)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 8/20/25 04:32, hubert depesz lubaczewski wrote:

On Tue, Aug 19, 2025 at 11:39:03AM -0700, Adrian Klaver wrote:

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

Given the subject line, what you are reporting is happening on the replica,
correct?

Yes.

If so where is the replica relative to the primary, in terms of network
distance?

=$ ping -c 10 primary
reports:
10 packets transmitted, 10 received, 0% packet loss, time 9181ms
rtt min/avg/max/mdev = 0.942/0.956/0.991/0.012 ms

Also what are the 'hardware' specifications on the replica instance?

c8g.48xlarge ec2 instance. It is arm64, 192 cores, with 384 gb of ram.

As for storage, this is relatitvely slow, because this db is rather
small:
gp3 500gb volume, with 6000 iops. At no point is IO in any way close to
limits, the whole db fits easily in ram.

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Are the 30+ messages all coming in on one connection or multiple
connections?

Also to be clear these are statements that are being run on the replica
locally, correct?

Does the AWS monitoring indicate any issues?

Best regards,

depesz

--
Adrian Klaver
adrian.klaver@aklaver.com

#12

hubert depesz lubaczewski

depesz@depesz.com

10 months ago

In reply to: Adrian Klaver (#11)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Are the 30+ messages all coming in on one connection or multiple
connections?

Multiple connections.

Also to be clear these are statements that are being run on the replica
locally, correct?

What do you mean locally?

Application servers are *not* on the same server as the db. So no, they
are not local.

Does the AWS monitoring indicate any issues?

Nope. All looks "fine".

Same for our monitoring - CPU usage, iops, load.

Best regards,

depesz

#13

Rob Sargent

robjsargent@gmail.com

10 months ago

In reply to: hubert depesz lubaczewski (#12)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Aug 20, 2025, at 10:08 AM, hubert depesz lubaczewski <depesz@depesz.com> wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Is that logger ms or db server ms? The latter seems unlikely to me.

#14

Thom Brown

thom@linux.com

10 months ago

In reply to: hubert depesz lubaczewski (#12)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, 20 Aug 2025 at 17:08, hubert depesz lubaczewski
<depesz@depesz.com> wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Are the 30+ messages all coming in on one connection or multiple
connections?

Multiple connections.

Also to be clear these are statements that are being run on the replica
locally, correct?

What do you mean locally?

Application servers are *not* on the same server as the db. So no, they
are not local.

Does the AWS monitoring indicate any issues?

Nope. All looks "fine".

Same for our monitoring - CPU usage, iops, load.

Best regards,

depesz

Do you have THP enabled? Can you use mpstat and see what %steal shows as?

Thom

#15

Adrian Klaver

adrian.klaver@aklaver.com

10 months ago

In reply to: hubert depesz lubaczewski (#12)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 8/20/25 09:08, hubert depesz lubaczewski wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Are the 30+ messages all coming in on one connection or multiple
connections?

Multiple connections.

Also to be clear these are statements that are being run on the replica
locally, correct?

What do you mean locally?

I should have been clearer. Are the queries being run against the
replica or the primary?

Application servers are *not* on the same server as the db. So no, they
are not local.

Where are the application servers relative to the replica server?

How many applications servers are hitting the database?

Does the AWS monitoring indicate any issues?

Nope. All looks "fine".

Same for our monitoring - CPU usage, iops, load.

Best regards,

depesz

--
Adrian Klaver
adrian.klaver@aklaver.com

#16

Alban Hertroys

haramrae@gmail.com

10 months ago

In reply to: Tom Lane (#9)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On 20 Aug 2025, at 14:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alban Hertroys <haramrae@gmail.com> writes:

Or is it just my end that turns Scott's (and only Scott's) messages into gibberish Chinese?

What it looks like from here is that the message body is in UTF16:
every other byte is zero. That gives my email client indigestion,
too.

Actually, I think it’s misreporting the message content as UTF-16, while it is in fact encoded as UTF-8. So the error appears to be on the sender side.

Apparently, Apple Mail.app is rather strict about following the reported encoding, while Gmail for example does some attempt to guess. That would explain the less terrible results with Gmail.

I can base64 decode the message just fine on the command line, which runs in a terminal emulator (Apple’s Terminal.app) that does handle UTF-8, but not UTF-16 by default.

Alban Hertroys
--
Illustratief voor het verschil tussen steek- en snijwapens:
Met mensen die een punthoofd hebben,
kun je goed de draak steken.
Met mensen die scherp zijn,
lukt dat juist slecht.

#17

hubert depesz lubaczewski

depesz@depesz.com

10 months ago

In reply to: Rob Sargent (#13)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, Aug 20, 2025 at 11:24:29AM -0600, Rob Sargent wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."

Is that logger ms or db server ms? The latter seems unlikely to me.

Sorry, I don't understand your question.

I know it's unlikely, so it seems that something "froze" pg, for ~ 1s.
The question is how to diagnose/debug what it is.

Best regards,

depesz

#18

hubert depesz lubaczewski

depesz@depesz.com

10 months ago

In reply to: Thom Brown (#14)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, Aug 20, 2025 at 06:30:00PM +0100, Thom Brown wrote:

Do you have THP enabled? Can you use mpstat and see what %steal shows as?

Sorry, what is "THP"? I tried searching for "what is thp", and most
common search results are related to some chemical compound.

mpstat, yes, we have it. Let me quickly show what I get:

Two days ago, at ~ 10:10pm UTC I saw this in Pg logs:

# grep -oP '^2025-08-19 22:09:2\d\.\d+ UTC' postgresql-2025-08-19_220000.csv | uniq -c | grep -C3 -P '^\s*\d\d'
2 2025-08-19 22:09:29.084 UTC
1 2025-08-19 22:09:29.094 UTC
2 2025-08-19 22:09:29.097 UTC
70 2025-08-19 22:09:29.109 UTC
90 2025-08-19 22:09:29.110 UTC
6 2025-08-19 22:09:29.111 UTC
1 2025-08-19 22:09:29.153 UTC
1 2025-08-19 22:09:29.555 UTC

As you can see we have 70, and then 90 messages all logged with the same
timestamp. All of them (160) were "duration:x" lines.

At the time, mpstat (called using `mpstat 5`) looked like this:

CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
22:09:04 all 2.25 0.00 0.27 0.24 0.00 0.10 0.00 0.00 0.00 97.14
22:09:09 all 2.70 0.00 0.28 0.27 0.00 0.12 0.00 0.00 0.00 96.63
22:09:14 all 2.83 0.00 0.63 0.23 0.00 0.09 0.00 0.00 0.00 96.22
22:09:19 all 2.66 0.00 0.28 0.23 0.00 0.10 0.00 0.00 0.00 96.74
22:09:24 all 2.36 0.00 0.27 0.21 0.00 0.10 0.00 0.00 0.00 97.06
22:09:29 all 2.49 0.00 0.40 0.40 0.00 0.11 0.00 0.00 0.00 96.60
22:09:34 all 2.65 0.00 0.28 0.23 0.00 0.11 0.00 0.00 0.00 96.73
22:09:39 all 2.26 0.00 0.26 0.21 0.00 0.09 0.00 0.00 0.00 97.19
22:09:44 all 2.31 0.00 0.25 0.22 0.00 0.09 0.00 0.00 0.00 97.13
22:09:49 all 2.44 0.00 0.22 0.24 0.00 0.08 0.00 0.00 0.00 97.02
22:09:54 all 2.23 0.00 0.23 0.28 0.00 0.10 0.00 0.00 0.00 97.16
22:09:59 all 2.48 0.00 0.26 0.30 0.00 0.11 0.00 0.00 0.00 96.85
22:10:04 all 2.29 0.00 0.28 0.26 0.00 0.12 0.00 0.00 0.00 97.06
22:10:09 all 2.31 0.00 0.22 0.21 0.00 0.11 0.00 0.00 0.00 97.14
22:10:14 all 2.38 0.00 0.25 0.21 0.00 0.10 0.00 0.00 0.00 97.05
22:10:19 all 2.54 0.00 0.24 0.20 0.00 0.11 0.00 0.00 0.00 96.91
22:10:24 all 2.26 0.00 0.25 0.22 0.00 0.10 0.00 0.00 0.00 97.17
22:10:29 all 2.12 0.00 0.27 0.24 0.00 0.09 0.00 0.00 0.00 97.27
22:10:34 all 2.56 0.00 0.26 0.26 0.00 0.12 0.00 0.00 0.00 96.80
22:10:39 all 2.16 0.00 0.23 0.24 0.00 0.10 0.00 0.00 0.00 97.27
22:10:44 all 2.37 0.00 0.26 0.22 0.00 0.10 0.00 0.00 0.00 97.05
22:10:49 all 2.25 0.00 0.25 0.61 0.00 0.09 0.00 0.00 0.00 96.80
22:10:54 all 2.41 0.00 0.28 0.22 0.00 0.10 0.00 0.00 0.00 96.99
22:10:59 all 2.83 0.00 0.29 0.19 0.00 0.12 0.00 0.00 0.00 96.57

Best regards,

depesz

#19

hubert depesz lubaczewski

depesz@depesz.com

10 months ago

In reply to: Adrian Klaver (#15)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Wed, Aug 20, 2025 at 10:45:13AM -0700, Adrian Klaver wrote:

On 8/20/25 09:08, hubert depesz lubaczewski wrote:

On Wed, Aug 20, 2025 at 08:14:47AM -0700, Adrian Klaver wrote:

Hmm.

From initial post:

"For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time."
Are the 30+ messages all coming in on one connection or multiple
connections?

Multiple connections.

Also to be clear these are statements that are being run on the replica
locally, correct?

What do you mean locally?

I should have been clearer. Are the queries being run against the replica or
the primary?

All to replica. Primary has its own work, of course, but the problem
we're experiencing is on replicas.

Application servers are *not* on the same server as the db. So no, they
are not local.

Where are the application servers relative to the replica server?

Well, there is a lot of them. All are in AWS EC2, same region, various
availability zones. Generally we use 3 AZs, with more or less equal
split, so I'd say ~ 33% of app servers is within the same AZ.

How many applications servers are hitting the database?

To be honest, I'm not sure. I have visibility into dbs, and bouncers,
not really into Apps. I know that these are automatically dynamically
scaled, so number of app server is very varying.

I'd say anything from 40 to 200 app servers hit first layer of bouncers,
which we usually have 6-9 (2-3 per az).

These go to 2nd layer of bouncers, on the db server itself.

Best regards,

depesz

#20

Thom Brown

thom@linux.com

10 months ago

In reply to: hubert depesz lubaczewski (#18)

Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

On Thu, 21 Aug 2025 at 11:03, hubert depesz lubaczewski
<depesz@depesz.com> wrote:

On Wed, Aug 20, 2025 at 06:30:00PM +0100, Thom Brown wrote:

Do you have THP enabled? Can you use mpstat and see what %steal shows as?

Sorry, what is "THP"? I tried searching for "what is thp", and most
common search results are related to some chemical compound.

Ah, yeah I meant transparent hugepage:

cat /sys/kernel/mm/transparent_hugepage/enabled

This should show it being set as "never".

mpstat, yes, we have it. Let me quickly show what I get:

Two days ago, at ~ 10:10pm UTC I saw this in Pg logs:

# grep -oP '^2025-08-19 22:09:2\d\.\d+ UTC' postgresql-2025-08-19_220000.csv | uniq -c | grep -C3 -P '^\s*\d\d'
2 2025-08-19 22:09:29.084 UTC
1 2025-08-19 22:09:29.094 UTC
2 2025-08-19 22:09:29.097 UTC
70 2025-08-19 22:09:29.109 UTC
90 2025-08-19 22:09:29.110 UTC
6 2025-08-19 22:09:29.111 UTC
1 2025-08-19 22:09:29.153 UTC
1 2025-08-19 22:09:29.555 UTC

As you can see we have 70, and then 90 messages all logged with the same
timestamp. All of them (160) were "duration:x" lines.

At the time, mpstat (called using `mpstat 5`) looked like this:

CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
22:09:04 all 2.25 0.00 0.27 0.24 0.00 0.10 0.00 0.00 0.00 97.14
22:09:09 all 2.70 0.00 0.28 0.27 0.00 0.12 0.00 0.00 0.00 96.63
22:09:14 all 2.83 0.00 0.63 0.23 0.00 0.09 0.00 0.00 0.00 96.22
22:09:19 all 2.66 0.00 0.28 0.23 0.00 0.10 0.00 0.00 0.00 96.74
22:09:24 all 2.36 0.00 0.27 0.21 0.00 0.10 0.00 0.00 0.00 97.06
22:09:29 all 2.49 0.00 0.40 0.40 0.00 0.11 0.00 0.00 0.00 96.60
22:09:34 all 2.65 0.00 0.28 0.23 0.00 0.11 0.00 0.00 0.00 96.73
22:09:39 all 2.26 0.00 0.26 0.21 0.00 0.09 0.00 0.00 0.00 97.19
22:09:44 all 2.31 0.00 0.25 0.22 0.00 0.09 0.00 0.00 0.00 97.13
22:09:49 all 2.44 0.00 0.22 0.24 0.00 0.08 0.00 0.00 0.00 97.02
22:09:54 all 2.23 0.00 0.23 0.28 0.00 0.10 0.00 0.00 0.00 97.16
22:09:59 all 2.48 0.00 0.26 0.30 0.00 0.11 0.00 0.00 0.00 96.85
22:10:04 all 2.29 0.00 0.28 0.26 0.00 0.12 0.00 0.00 0.00 97.06
22:10:09 all 2.31 0.00 0.22 0.21 0.00 0.11 0.00 0.00 0.00 97.14
22:10:14 all 2.38 0.00 0.25 0.21 0.00 0.10 0.00 0.00 0.00 97.05
22:10:19 all 2.54 0.00 0.24 0.20 0.00 0.11 0.00 0.00 0.00 96.91
22:10:24 all 2.26 0.00 0.25 0.22 0.00 0.10 0.00 0.00 0.00 97.17
22:10:29 all 2.12 0.00 0.27 0.24 0.00 0.09 0.00 0.00 0.00 97.27
22:10:34 all 2.56 0.00 0.26 0.26 0.00 0.12 0.00 0.00 0.00 96.80
22:10:39 all 2.16 0.00 0.23 0.24 0.00 0.10 0.00 0.00 0.00 97.27
22:10:44 all 2.37 0.00 0.26 0.22 0.00 0.10 0.00 0.00 0.00 97.05
22:10:49 all 2.25 0.00 0.25 0.61 0.00 0.09 0.00 0.00 0.00 96.80
22:10:54 all 2.41 0.00 0.28 0.22 0.00 0.10 0.00 0.00 0.00 96.99
22:10:59 all 2.83 0.00 0.29 0.19 0.00 0.12 0.00 0.00 0.00 96.57

This output looks fine, so it doesn't show anything concerning, so
suggests the issue is somehow on the Postgres side.

Did you happen to poll pg_stat_activity at the time to see whether you
had lots of IPC waits? I'm wondering whether the storage layer is
freezing up for a moment.

Thom

#21

hubert depesz lubaczewski

depesz@depesz.com

10 months ago

In reply to: Thom Brown (#20)

#22

Adrian Klaver

adrian.klaver@aklaver.com

10 months ago

In reply to: hubert depesz lubaczewski (#19)

#23