pg_stat_io not tracking smgrwriteback() is confusing

jkatz@postgresql.org

over 2 years ago

In reply to: Andres Freund (#1)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On 4/19/23 1:23 PM, Andres Freund wrote:

Hi,

I noticed that the numbers in pg_stat_io dont't quite add up to what I
expected in write heavy workloads. Particularly for checkpointer, the numbers
for "write" in log_checkpoints output are larger than what is visible in
pg_stat_io.

That partially is because log_checkpoints' "write" covers way too many things,
but there's an issue with pg_stat_io as well:

Checkpoints, and some other sources of writes, will often end up doing a lot
of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
pre-existing forms of IO statistics.

It seems pretty clear that we should track writeback as well. I wonder if it's
worth doing so for 16? It'd give a more complete picture that way. The
counter-argument I see is that we didn't track the time for it in existing
stats either, and that nobody complained - but I suspect that's mostly because
nobody knew to look.

[RMT hat]

(sorry for slow reply on this, I've been out for a few days).

It does sound generally helpful to track writeback to ensure anyone
building around pg_stat_io can see tthe more granular picture. How big
of an effort is this? Do you think this helps to complete the feature
for v16?

Thanks,

Jonathan

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Jonathan S. Katz (#2)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On Sun, Apr 23, 2023 at 12:55 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:

On 4/19/23 1:23 PM, Andres Freund wrote:

Hi,

I noticed that the numbers in pg_stat_io dont't quite add up to what I
expected in write heavy workloads. Particularly for checkpointer, the numbers
for "write" in log_checkpoints output are larger than what is visible in
pg_stat_io.

That partially is because log_checkpoints' "write" covers way too many things,
but there's an issue with pg_stat_io as well:

Checkpoints, and some other sources of writes, will often end up doing a lot
of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
pre-existing forms of IO statistics.

It seems pretty clear that we should track writeback as well.

Agreed. +1.

I wonder if it's
worth doing so for 16? It'd give a more complete picture that way. The
counter-argument I see is that we didn't track the time for it in existing
stats either, and that nobody complained - but I suspect that's mostly because
nobody knew to look.

[RMT hat]

(sorry for slow reply on this, I've been out for a few days).

It does sound generally helpful to track writeback to ensure anyone
building around pg_stat_io can see tthe more granular picture. How big
of an effort is this?

Right, I think this is the key factor to decide whether we can get
this in PG16 or not. If this is just adding a new column and a few
existing stats update calls then it should be okay to get in but if
this requires some more complex work then we can probably update the
docs.

--
With Regards,
Amit Kapila.

melanieplageman@gmail.com

over 2 years ago

In reply to: Andres Freund (#1)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On Wed, Apr 19, 2023 at 10:23:26AM -0700, Andres Freund wrote:

Hi,

I noticed that the numbers in pg_stat_io dont't quite add up to what I
expected in write heavy workloads. Particularly for checkpointer, the numbers
for "write" in log_checkpoints output are larger than what is visible in
pg_stat_io.

That partially is because log_checkpoints' "write" covers way too many things,
but there's an issue with pg_stat_io as well:

Checkpoints, and some other sources of writes, will often end up doing a lot
of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
pre-existing forms of IO statistics.

It seems pretty clear that we should track writeback as well. I wonder if it's
worth doing so for 16? It'd give a more complete picture that way. The
counter-argument I see is that we didn't track the time for it in existing
stats either, and that nobody complained - but I suspect that's mostly because
nobody knew to look.

Not complaining about making pg_stat_io more accurate, but what exactly
would we be tracking for smgrwriteback()? I assume you are talking about
IO timing. AFAICT, on Linux, it does sync_file_range() with
SYNC_FILE_RANGE_WRITE, which is asynchronous. Wouldn't we just be
tracking the system call overhead time?

- Melanie

andres@anarazel.de

over 2 years ago

In reply to: Melanie Plageman (#4)

Re: pg_stat_io not tracking smgrwriteback() is confusing

Hi,

On 2023-04-24 16:39:36 -0400, Melanie Plageman wrote:

On Wed, Apr 19, 2023 at 10:23:26AM -0700, Andres Freund wrote:

Hi,

I noticed that the numbers in pg_stat_io dont't quite add up to what I
expected in write heavy workloads. Particularly for checkpointer, the numbers
for "write" in log_checkpoints output are larger than what is visible in
pg_stat_io.

That partially is because log_checkpoints' "write" covers way too many things,
but there's an issue with pg_stat_io as well:

Checkpoints, and some other sources of writes, will often end up doing a lot
of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
pre-existing forms of IO statistics.

It seems pretty clear that we should track writeback as well. I wonder if it's
worth doing so for 16? It'd give a more complete picture that way. The
counter-argument I see is that we didn't track the time for it in existing
stats either, and that nobody complained - but I suspect that's mostly because
nobody knew to look.

Not complaining about making pg_stat_io more accurate, but what exactly
would we be tracking for smgrwriteback()? I assume you are talking about
IO timing. AFAICT, on Linux, it does sync_file_range() with
SYNC_FILE_RANGE_WRITE, which is asynchronous. Wouldn't we just be
tracking the system call overhead time?

It starts blocking once "enough" IO is in flight. For things like an immediate
checkpoint, that can happen fairly quickly, unless you have a very fast IO
subsystem. So often it'll not matter whether we track smgrwriteback(), but
when it matter, it can matter a lot.

As an example, I inited' a pgbench w/ scale 1000, on a decent but not great
NVMe SSD. Created dirty data with:

c=96;/srv/dev/build/m-opt/src/bin/pgbench/pgbench --random-seed=0 -n -M prepared -c$c -j$c -T30 -P1
and then measured the checkpoint:
perf trace -s -p $pid_of_checkpointer psql -x -c "SELECT pg_stat_reset_shared('io')" -c "checkpoint"

postgres (367444), 1891280 events, 100.0%

syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
sync_file_range 359176 0 4560.670 0.002 0.013 238.955 10.36%
pwrite64 582964 0 2874.634 0.003 0.005 0.156 0.06%
fsync 242 0 251.631 0.001 1.040 42.166 18.81%
openat 317 65 2.171 0.002 0.007 0.068 5.69%
rename 69 0 1.973 0.012 0.029 0.084 5.81%
fdatasync 1 0 1.543 1.543 1.543 1.543 0.00%
unlink 150 137 1.278 0.002 0.009 0.062 10.69%
close 250 0 0.694 0.001 0.003 0.007 3.14%
newfstatat 140 68 0.667 0.001 0.005 0.022 7.26%
write 5 0 0.067 0.005 0.013 0.025 24.55%
lseek 14 0 0.050 0.001 0.004 0.018 33.87%
getdents64 8 0 0.047 0.002 0.006 0.022 39.51%
kill 3 0 0.029 0.008 0.010 0.011 10.18%
epoll_wait 2 1 0.006 0.000 0.003 0.006 100.00%
read 1 0 0.004 0.004 0.004 0.004 0.00%

Log output:

2023-04-24 14:11:59.234 PDT [367444][checkpointer][:0][] LOG: checkpoint starting: immediate force wait
2023-04-24 14:12:09.236 PDT [367444][checkpointer][:0][] LOG: checkpoint complete: wrote 595974 buffers (28.4%); 0 WAL file(s) added, 0 removed, 68 recycled; write=9.740 s, sync=0.057 s, total=10.002 s; sync files=27, longest=0.043 s, average=0.003 s; distance=4467386 kB, estimate=4467386 kB; lsn=6/E5D33F98, redo lsn=6/E5D33F28

# SELECT writes, write_time, fsyncs, fsync_time FROM pg_stat_io WHERE backend_type = 'checkpointer';
┌────────┬────────────────────┬────────┬────────────┐
│ writes │ write_time │ fsyncs │ fsync_time │
├────────┼────────────────────┼────────┼────────────┤
│ 595914 │ 4002.1730000000002 │ 24 │ 46.359 │
└────────┴────────────────────┴────────┴────────────┘

Greetings,

Andres Freund

melanieplageman@gmail.com

over 2 years ago

In reply to: Andres Freund (#5)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On Mon, Apr 24, 2023 at 02:14:32PM -0700, Andres Freund wrote:

Hi,

On 2023-04-24 16:39:36 -0400, Melanie Plageman wrote:

On Wed, Apr 19, 2023 at 10:23:26AM -0700, Andres Freund wrote:

Hi,

I noticed that the numbers in pg_stat_io dont't quite add up to what I
expected in write heavy workloads. Particularly for checkpointer, the numbers
for "write" in log_checkpoints output are larger than what is visible in
pg_stat_io.

That partially is because log_checkpoints' "write" covers way too many things,
but there's an issue with pg_stat_io as well:

Checkpoints, and some other sources of writes, will often end up doing a lot
of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
pre-existing forms of IO statistics.

It seems pretty clear that we should track writeback as well. I wonder if it's
worth doing so for 16? It'd give a more complete picture that way. The
counter-argument I see is that we didn't track the time for it in existing
stats either, and that nobody complained - but I suspect that's mostly because
nobody knew to look.

Not complaining about making pg_stat_io more accurate, but what exactly
would we be tracking for smgrwriteback()? I assume you are talking about
IO timing. AFAICT, on Linux, it does sync_file_range() with
SYNC_FILE_RANGE_WRITE, which is asynchronous. Wouldn't we just be
tracking the system call overhead time?

It starts blocking once "enough" IO is in flight. For things like an immediate
checkpoint, that can happen fairly quickly, unless you have a very fast IO
subsystem. So often it'll not matter whether we track smgrwriteback(), but
when it matter, it can matter a lot.

I see. So, it sounds like this is most likely to happen for checkpointer
and not likely to happen for other backends who call
ScheduleBufferTagForWriteback(). Also, it seems like this (given the
current code) is only reachable for permanent relations (i.e. not for IO
object temp relation). If other backend types than checkpointer may call
smgrwriteback(), we likely have to consider the IO context. I would
imagine that we want to smgrwriteback() timing to writes/write time for
the relevant IO context and backend type.

- Melanie

andres@anarazel.de

over 2 years ago

In reply to: Melanie Plageman (#6)

Re: pg_stat_io not tracking smgrwriteback() is confusing

Hi,

On 2023-04-24 17:37:48 -0400, Melanie Plageman wrote:

On Mon, Apr 24, 2023 at 02:14:32PM -0700, Andres Freund wrote:

It starts blocking once "enough" IO is in flight. For things like an immediate
checkpoint, that can happen fairly quickly, unless you have a very fast IO
subsystem. So often it'll not matter whether we track smgrwriteback(), but
when it matter, it can matter a lot.

I see. So, it sounds like this is most likely to happen for checkpointer
and not likely to happen for other backends who call
ScheduleBufferTagForWriteback().

It's more likely, but once the IO subsystem is busy, it'll also happen for
other users ScheduleBufferTagForWriteback().

Also, it seems like this (given the current code) is only reachable for
permanent relations (i.e. not for IO object temp relation). If other backend
types than checkpointer may call smgrwriteback(), we likely have to consider
the IO context.

I think we should take it into account - it'd e.g. interesting to see a COPY
is bottlenecked on smgrwriteback() rather than just writing the data.

I would imagine that we want to smgrwriteback() timing to writes/write time
for the relevant IO context and backend type.

Yes.

Greetings,

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Amit Kapila (#3)

Re: pg_stat_io not tracking smgrwriteback() is confusing

Hi,

On 2023-04-24 10:52:15 +0530, Amit Kapila wrote:

On Sun, Apr 23, 2023 at 12:55 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:

I wonder if it's
worth doing so for 16? It'd give a more complete picture that way. The
counter-argument I see is that we didn't track the time for it in existing
stats either, and that nobody complained - but I suspect that's mostly because
nobody knew to look.

[RMT hat]

(sorry for slow reply on this, I've been out for a few days).

It does sound generally helpful to track writeback to ensure anyone
building around pg_stat_io can see tthe more granular picture. How big
of an effort is this?

Right, I think this is the key factor to decide whether we can get
this in PG16 or not. If this is just adding a new column and a few
existing stats update calls then it should be okay to get in but if
this requires some more complex work then we can probably update the
docs.

I suspect it should really just be adding a few stats calls. The only possible
complication that I can see is that we might need to pass a bit more context
down in a place or two.

Greetings,

Andres Freund

melanieplageman@gmail.com

over 2 years ago

In reply to: Andres Freund (#7)

1 attachment(s)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On Mon, Apr 24, 2023 at 6:13 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-04-24 17:37:48 -0400, Melanie Plageman wrote:

On Mon, Apr 24, 2023 at 02:14:32PM -0700, Andres Freund wrote:

It starts blocking once "enough" IO is in flight. For things like an

immediate

checkpoint, that can happen fairly quickly, unless you have a very

fast IO

subsystem. So often it'll not matter whether we track smgrwriteback(),

but

when it matter, it can matter a lot.

I see. So, it sounds like this is most likely to happen for checkpointer
and not likely to happen for other backends who call
ScheduleBufferTagForWriteback().

It's more likely, but once the IO subsystem is busy, it'll also happen for
other users ScheduleBufferTagForWriteback().

Also, it seems like this (given the current code) is only reachable for
permanent relations (i.e. not for IO object temp relation). If other

backend

types than checkpointer may call smgrwriteback(), we likely have to

consider

the IO context.

I think we should take it into account - it'd e.g. interesting to see a
COPY
is bottlenecked on smgrwriteback() rather than just writing the data.

With the quick and dirty attached patch and using your example but with a
pgbench -T200 on my rather fast local NVMe SSD, you can still see quite
a difference.
This is with a stats reset before the checkpoint.

unpatched:

patched:

I did compare client backend stats before and after pgbench and it made
basically no difference. I'll do a COPY example like you mentioned.

Patch needs cleanup/comments and a bit more work, but I could do with
a sanity check review on the approach.

- Melanie

Attachments:

v1-0001-Add-writeback-to-pg_stat_io-writes.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Add-writeback-to-pg_stat_io-writes.patchDownload

From 6c5661ee5f0efbda9d246184f40cd799c21b5d2a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 24 Apr 2023 18:21:54 -0400
Subject: [PATCH v1] Add writeback to pg_stat_io writes.

---
 src/backend/postmaster/bgwriter.c     |  4 ++--
 src/backend/storage/buffer/buf_init.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 12 ++++++++++--
 src/include/storage/buf_internals.h   |  4 +++-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..4c3540d6e1 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -129,7 +129,7 @@ BackgroundWriterMain(void)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
-	WritebackContextInit(&wb_context, &bgwriter_flush_after);
+	WritebackContextInit(&wb_context, IOCONTEXT_NORMAL, &bgwriter_flush_after);
 
 	/*
 	 * If an exception is encountered, processing resumes here.
@@ -185,7 +185,7 @@ BackgroundWriterMain(void)
 		MemoryContextResetAndDeleteChildren(bgwriter_context);
 
 		/* re-initialize to avoid repeated errors causing problems */
-		WritebackContextInit(&wb_context, &bgwriter_flush_after);
+		WritebackContextInit(&wb_context, IOCONTEXT_NORMAL, &bgwriter_flush_after);
 
 		/* Now we can allow interrupts again */
 		RESUME_INTERRUPTS();
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0057443f0c..1d80532f3b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -146,7 +146,7 @@ InitBufferPool(void)
 	StrategyInitialize(!foundDescs);
 
 	/* Initialize per-backend file flush context */
-	WritebackContextInit(&BackendWritebackContext,
+	WritebackContextInit(&BackendWritebackContext, IOCONTEXT_NORMAL,
 						 &backend_flush_after);
 }
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..774dbadd08 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1685,6 +1685,7 @@ again:
 		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
 		LWLockRelease(content_lock);
 
+		BackendWritebackContext.io_context = io_context;
 		ScheduleBufferTagForWriteback(&BackendWritebackContext,
 									  &buf_hdr->tag);
 	}
@@ -2555,7 +2556,7 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
-	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	WritebackContextInit(&wb_context, IOCONTEXT_NORMAL, &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -5433,10 +5434,11 @@ ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
  * writeback control will be performed.
  */
 void
-WritebackContextInit(WritebackContext *context, int *max_pending)
+WritebackContextInit(WritebackContext *context, IOContext io_context, int *max_pending)
 {
 	Assert(*max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);
 
+	context->io_context = io_context;
 	context->max_pending = max_pending;
 	context->nr_pending = 0;
 }
@@ -5491,6 +5493,7 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 void
 IssuePendingWritebacks(WritebackContext *context)
 {
+	instr_time	io_start;
 	int			i;
 
 	if (context->nr_pending == 0)
@@ -5502,6 +5505,8 @@ IssuePendingWritebacks(WritebackContext *context)
 	 */
 	sort_pending_writebacks(context->pending_writebacks, context->nr_pending);
 
+	io_start = pgstat_prepare_io_time();
+
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
 	 * through the, now sorted, array of pending flushes, and look forward to
@@ -5555,6 +5560,9 @@ IssuePendingWritebacks(WritebackContext *context)
 		smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);
 	}
 
+	pgstat_count_io_op_time(IOOBJECT_RELATION, context->io_context, IOOP_WRITE,
+			io_start, context->nr_pending);
+
 	context->nr_pending = 0;
 }
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 136cf8fbaf..0d6e93ffad 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -295,6 +295,8 @@ typedef struct PendingWriteback
 /* struct forward declared in bufmgr.h */
 typedef struct WritebackContext
 {
+	IOContext io_context;
+
 	/* pointer to the max number of writeback requests to coalesce */
 	int		   *max_pending;
 
@@ -387,7 +389,7 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
  * Internal buffer management routines
  */
 /* bufmgr.c */
-extern void WritebackContextInit(WritebackContext *context, int *max_pending);
+extern void WritebackContextInit(WritebackContext *context, IOContext io_context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
-- 
2.37.2

#10

andres@anarazel.de

over 2 years ago

In reply to: Melanie Plageman (#9)

Re: pg_stat_io not tracking smgrwriteback() is confusing

Hi,

On 2023-04-24 18:36:24 -0400, Melanie Plageman wrote:

On Mon, Apr 24, 2023 at 6:13 PM Andres Freund <andres@anarazel.de> wrote:

Also, it seems like this (given the current code) is only reachable for
permanent relations (i.e. not for IO object temp relation). If other

backend

types than checkpointer may call smgrwriteback(), we likely have to

consider

the IO context.

I think we should take it into account - it'd e.g. interesting to see a
COPY
is bottlenecked on smgrwriteback() rather than just writing the data.

With the quick and dirty attached patch and using your example but with a
pgbench -T200 on my rather fast local NVMe SSD, you can still see quite
a difference.

Quite a difference between what?

What scale of pgbench did you use?

-T200 is likely not a good idea, because a timed checkpoint might "interfere",
unless you use a non-default checkpoint_timeout. A timed checkpoint won't show
the issue as easily, because checkpointer spend most of the time sleeping.

This is with a stats reset before the checkpoint.

unpatched:

backend_type | object | context | writes | write_time |
fsyncs | fsync_time
---------------------+---------------+-----------+---------+------------+---------+------------
background writer | relation | normal | 443 | 1.408 |
0 | 0
checkpointer | relation | normal | 187804 | 396.829 |
47 | 254.226

patched:

backend_type | object | context | writes | write_time
| fsyncs | fsync_time
---------------------+---------------+-----------+---------+--------------------+--------+------------
background writer | relation | normal | 917 |
4.4670000000000005 | 0 | 0
checkpointer | relation | normal | 375798 |
977.354 | 48 | 202.514

I did compare client backend stats before and after pgbench and it made
basically no difference. I'll do a COPY example like you mentioned.

Patch needs cleanup/comments and a bit more work, but I could do with
a sanity check review on the approach.

I was thinking we'd track writeback separately from the write, rather than
attributing the writeback to "write". Otherwise it looks good, based on a
quick skim.

Greetings,

Andres Freund

#11

melanieplageman@gmail.com

over 2 years ago

In reply to: Andres Freund (#10)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On Mon, Apr 24, 2023 at 03:56:54PM -0700, Andres Freund wrote:

Hi,

On 2023-04-24 18:36:24 -0400, Melanie Plageman wrote:

On Mon, Apr 24, 2023 at 6:13 PM Andres Freund <andres@anarazel.de> wrote:

Also, it seems like this (given the current code) is only reachable for
permanent relations (i.e. not for IO object temp relation). If other

backend

types than checkpointer may call smgrwriteback(), we likely have to

consider

the IO context.

I think we should take it into account - it'd e.g. interesting to see a
COPY
is bottlenecked on smgrwriteback() rather than just writing the data.

With the quick and dirty attached patch and using your example but with a
pgbench -T200 on my rather fast local NVMe SSD, you can still see quite
a difference.

Quite a difference between what?

With and without the patch. Meaning: clearly tracking writeback is a good idea.

What scale of pgbench did you use?

1000, as you did

-T200 is likely not a good idea, because a timed checkpoint might "interfere",
unless you use a non-default checkpoint_timeout. A timed checkpoint won't show
the issue as easily, because checkpointer spend most of the time sleeping.

Ah, I see. I did not use a non-default checkpoint timeout.

Patch needs cleanup/comments and a bit more work, but I could do with
a sanity check review on the approach.

I was thinking we'd track writeback separately from the write, rather than
attributing the writeback to "write". Otherwise it looks good, based on a
quick skim.

Like you want a separate IOOp IOOP_WRITEBACK? Interesting. Okay.

- Melanie

#12

jkatz@postgresql.org

over 2 years ago

In reply to: Andres Freund (#8)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On 4/24/23 6:14 PM, Andres Freund wrote:

Hi,

On 2023-04-24 10:52:15 +0530, Amit Kapila wrote:

On Sun, Apr 23, 2023 at 12:55 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:

I wonder if it's
worth doing so for 16? It'd give a more complete picture that way. The
counter-argument I see is that we didn't track the time for it in existing
stats either, and that nobody complained - but I suspect that's mostly because
nobody knew to look.

[RMT hat]

(sorry for slow reply on this, I've been out for a few days).

It does sound generally helpful to track writeback to ensure anyone
building around pg_stat_io can see tthe more granular picture. How big
of an effort is this?

Right, I think this is the key factor to decide whether we can get
this in PG16 or not. If this is just adding a new column and a few
existing stats update calls then it should be okay to get in but if
this requires some more complex work then we can probably update the
docs.

I suspect it should really just be adding a few stats calls. The only possible
complication that I can see is that we might need to pass a bit more context
down in a place or two.

OK. So far it sounds reasonable to include. I think we should add this
as an open item. I don't know if we need to set a deadline just yet, but
we should try to keep go/nogo to earlier in the beta cycle.

Thanks,

Jonathan

#13

melanieplageman@gmail.com

over 2 years ago

In reply to: Melanie Plageman (#11)

1 attachment(s)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On Mon, Apr 24, 2023 at 7:02 PM Melanie Plageman <melanieplageman@gmail.com>
wrote:

On Mon, Apr 24, 2023 at 03:56:54PM -0700, Andres Freund wrote:

I was thinking we'd track writeback separately from the write, rather

than

attributing the writeback to "write". Otherwise it looks good, based on

a

quick skim.

Like you want a separate IOOp IOOP_WRITEBACK? Interesting. Okay.

Okay, attached v2 does this (adds IOOP_WRITEBACK).

With my patch applied and the same pgbench setup as you (for -T30):

After pgbench:

backend_type | object | context | writes | write_time |
writebacks | writeback_time | fsyncs | fsync_time |
---------------------+---------------+-----------+----------+------------+------------+----------------+---------+------------+
background writer | relation | normal | 5581 | 23.416 |
5568 | 32.33 | 0 | 0 |
checkpointer | relation | normal | 89116 | 295.576 |
89106 | 416.5 | 84 | 5242.764 |

and then after a stats reset followed by an explicit checkpoint:

I've yet to cook up a client backend test case (e.g. with COPY). I've taken
that as a todo.

I have a few outstanding questions:

1) Does it make sense for writebacks to count the number of blocks for
which writeback was requested or the number of calls to smgrwriteback() or
the number of actual syscalls made? We don't actually know from outside
of mdwriteback() how many FileWriteback() calls we will make.

2) I'm a little nervous about not including IOObject in the writeback
context. Technically, there is nothing stopping local buffer code from
calling IssuePendingWritebacks(). Right now, local buffer code doesn't
do ScheduleBufferTagForWriteback(). But it doesn't seem quite right to
hardcode in IOOBJECT_RELATION when there is nothing wrong with
requesting writeback of local buffers (AFAIK). What do you think?

3) Should any restrictions be added to pgstat_tracks_io_object() or
pgstat_tracks_io_op()? I couldn't think of any backend types or IO
contexts which would not do writeback as a rule. Also, though we don't
do writeback for temp tables now, it isn't nonsensical to do so. In
this version, I didn't add any restrictions.

Docs need work. I added a placeholder for the new columns. I'll update it
once we decide what writebacks should actually count. And, I don't think
we can do any kind of ongoing test.

- Melanie

Attachments:

v2-0001-Add-writeback-to-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-writeback-to-pg_stat_io.patchDownload

From b20765526e5c4bb5a82734ad9977552ebf9dad2f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 24 Apr 2023 18:21:54 -0400
Subject: [PATCH v2] Add writeback to pg_stat_io

---
 doc/src/sgml/monitoring.sgml          | 24 ++++++++++++++++++++++++
 src/backend/catalog/system_views.sql  |  2 ++
 src/backend/postmaster/bgwriter.c     |  4 ++--
 src/backend/storage/buffer/buf_init.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 12 ++++++++++--
 src/backend/utils/adt/pgstatfuncs.c   |  5 +++++
 src/include/catalog/pg_proc.dat       |  6 +++---
 src/include/pgstat.h                  |  3 ++-
 src/include/storage/buf_internals.h   |  4 +++-
 src/test/regress/expected/rules.out   |  4 +++-
 10 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 99f7f95c39..62572327cf 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3867,6 +3867,30 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writebacks</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of writeback operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writeback_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in writeback operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 48aacf66ee..d0c932ad0e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1131,6 +1131,8 @@ SELECT
        b.read_time,
        b.writes,
        b.write_time,
+       b.writebacks,
+       b.writeback_time,
        b.extends,
        b.extend_time,
        b.op_bytes,
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..4c3540d6e1 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -129,7 +129,7 @@ BackgroundWriterMain(void)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
-	WritebackContextInit(&wb_context, &bgwriter_flush_after);
+	WritebackContextInit(&wb_context, IOCONTEXT_NORMAL, &bgwriter_flush_after);
 
 	/*
 	 * If an exception is encountered, processing resumes here.
@@ -185,7 +185,7 @@ BackgroundWriterMain(void)
 		MemoryContextResetAndDeleteChildren(bgwriter_context);
 
 		/* re-initialize to avoid repeated errors causing problems */
-		WritebackContextInit(&wb_context, &bgwriter_flush_after);
+		WritebackContextInit(&wb_context, IOCONTEXT_NORMAL, &bgwriter_flush_after);
 
 		/* Now we can allow interrupts again */
 		RESUME_INTERRUPTS();
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0057443f0c..1d80532f3b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -146,7 +146,7 @@ InitBufferPool(void)
 	StrategyInitialize(!foundDescs);
 
 	/* Initialize per-backend file flush context */
-	WritebackContextInit(&BackendWritebackContext,
+	WritebackContextInit(&BackendWritebackContext, IOCONTEXT_NORMAL,
 						 &backend_flush_after);
 }
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..188e135feb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1685,6 +1685,7 @@ again:
 		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
 		LWLockRelease(content_lock);
 
+		BackendWritebackContext.io_context = io_context;
 		ScheduleBufferTagForWriteback(&BackendWritebackContext,
 									  &buf_hdr->tag);
 	}
@@ -2555,7 +2556,7 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
-	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	WritebackContextInit(&wb_context, IOCONTEXT_NORMAL, &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -5433,10 +5434,11 @@ ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
  * writeback control will be performed.
  */
 void
-WritebackContextInit(WritebackContext *context, int *max_pending)
+WritebackContextInit(WritebackContext *context, IOContext io_context, int *max_pending)
 {
 	Assert(*max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);
 
+	context->io_context = io_context;
 	context->max_pending = max_pending;
 	context->nr_pending = 0;
 }
@@ -5491,6 +5493,7 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 void
 IssuePendingWritebacks(WritebackContext *context)
 {
+	instr_time	io_start;
 	int			i;
 
 	if (context->nr_pending == 0)
@@ -5502,6 +5505,8 @@ IssuePendingWritebacks(WritebackContext *context)
 	 */
 	sort_pending_writebacks(context->pending_writebacks, context->nr_pending);
 
+	io_start = pgstat_prepare_io_time();
+
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
 	 * through the, now sorted, array of pending flushes, and look forward to
@@ -5555,6 +5560,9 @@ IssuePendingWritebacks(WritebackContext *context)
 		smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);
 	}
 
+	pgstat_count_io_op_time(IOOBJECT_RELATION, context->io_context, IOOP_WRITEBACK,
+							io_start, context->nr_pending);
+
 	context->nr_pending = 0;
 }
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0d57b6a7c0..70da0a2de1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1268,6 +1268,8 @@ typedef enum io_stat_col
 	IO_COL_READ_TIME,
 	IO_COL_WRITES,
 	IO_COL_WRITE_TIME,
+	IO_COL_WRITEBACKS,
+	IO_COL_WRITEBACK_TIME,
 	IO_COL_EXTENDS,
 	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
@@ -1303,6 +1305,8 @@ pgstat_get_io_op_index(IOOp io_op)
 			return IO_COL_REUSES;
 		case IOOP_WRITE:
 			return IO_COL_WRITES;
+		case IOOP_WRITEBACK:
+			return IO_COL_WRITEBACKS;
 	}
 
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
@@ -1322,6 +1326,7 @@ pgstat_get_io_time_index(IOOp io_op)
 	{
 		case IOOP_READ:
 		case IOOP_WRITE:
+		case IOOP_WRITEBACK:
 		case IOOP_EXTEND:
 		case IOOP_FSYNC:
 			return pgstat_get_io_op_index(io_op) + 1;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b2bc81b15f..e766b9b600 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5760,9 +5760,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,writebacks,writeback_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff4ad5b6d..57a2c0866a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -300,9 +300,10 @@ typedef enum IOOp
 	IOOP_READ,
 	IOOP_REUSE,
 	IOOP_WRITE,
+	IOOP_WRITEBACK,
 } IOOp;
 
-#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+#define IOOP_NUM_TYPES (IOOP_WRITEBACK + 1)
 
 typedef struct PgStat_BktypeIO
 {
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 136cf8fbaf..331ca26ce3 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -295,6 +295,8 @@ typedef struct PendingWriteback
 /* struct forward declared in bufmgr.h */
 typedef struct WritebackContext
 {
+	IOContext	io_context;
+
 	/* pointer to the max number of writeback requests to coalesce */
 	int		   *max_pending;
 
@@ -387,7 +389,7 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
  * Internal buffer management routines
  */
 /* bufmgr.c */
-extern void WritebackContextInit(WritebackContext *context, int *max_pending);
+extern void WritebackContextInit(WritebackContext *context, IOContext io_context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 69957687fe..0bd6db2bf8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1887,6 +1887,8 @@ pg_stat_io| SELECT backend_type,
     read_time,
     writes,
     write_time,
+    writebacks,
+    writeback_time,
     extends,
     extend_time,
     op_bytes,
@@ -1896,7 +1898,7 @@ pg_stat_io| SELECT backend_type,
     fsyncs,
     fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, writebacks, writeback_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.37.2

#14

melanieplageman@gmail.com

over 2 years ago

In reply to: Melanie Plageman (#13)

1 attachment(s)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On Mon, Apr 24, 2023 at 9:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I've yet to cook up a client backend test case (e.g. with COPY). I've taken
that as a todo.

It was trivial to see client backend writebacks in almost any scenario
once I set backend_flush_after. I wonder if it is worth mentioning the
various "*flush_after" gucs in the docs?

I have a few outstanding questions:

1) Does it make sense for writebacks to count the number of blocks for
which writeback was requested or the number of calls to smgrwriteback() or
the number of actual syscalls made? We don't actually know from outside
of mdwriteback() how many FileWriteback() calls we will make.

So, in the attached v3, I've kept the first method: writebacks are the
number of blocks which the backend has requested writeback of. I'd like
it to be clear in the docs exactly what writebacks are (so that people
know not to add them together with writes or something like that). I
made an effort but could use further docs review.

2) I'm a little nervous about not including IOObject in the writeback
context. Technically, there is nothing stopping local buffer code from
calling IssuePendingWritebacks(). Right now, local buffer code doesn't
do ScheduleBufferTagForWriteback(). But it doesn't seem quite right to
hardcode in IOOBJECT_RELATION when there is nothing wrong with
requesting writeback of local buffers (AFAIK). What do you think?

I've gone ahead and added IOObject to the WritebackContext.

- Melanie

Attachments:

v3-0001-Add-writeback-to-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Add-writeback-to-pg_stat_io.patchDownload

From 8db3a8319553777024379add76527b8a6908028b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 24 Apr 2023 18:21:54 -0400
Subject: [PATCH v3] Add writeback to pg_stat_io

28e626bde00 added the notion of IOOps but neglected to include
writeback. With the addition of IO timing to pg_stat_io in ac8d53dae5,
the omission of writeback caused some confusion. Checkpointer write
timing in pg_stat_io often differed greatly from the write timing
written to the log. To fix this, add IOOp IOOP_WRITEBACK and track
writebacks and writeback timing in pg_stat_io.

Discussion: https://postgr.es/m/20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de
---
 doc/src/sgml/monitoring.sgml          | 27 +++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql  |  2 ++
 src/backend/postmaster/bgwriter.c     | 10 ++++++++--
 src/backend/storage/buffer/buf_init.c | 11 ++++++++---
 src/backend/storage/buffer/bufmgr.c   | 20 ++++++++++++++++++--
 src/backend/utils/adt/pgstatfuncs.c   |  5 +++++
 src/include/catalog/pg_proc.dat       |  6 +++---
 src/include/pgstat.h                  |  3 ++-
 src/include/storage/buf_internals.h   |  6 +++++-
 src/test/regress/expected/rules.out   |  4 +++-
 src/test/regress/expected/stats.out   |  4 ++--
 src/test/regress/sql/stats.sql        |  4 ++--
 12 files changed, 85 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 99f7f95c39..d6062b3d89 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3867,6 +3867,33 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writebacks</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of units of size <varname>op_bytes</varname> which the backend
+        requested the kernel write out to permanent storage.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writeback_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in writeback operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
+        does not necessarily count the time spent by the kernel writing the
+        data out. The backend will initiate write-out of the dirty pages and
+        wait only if the request queue is full.
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 48aacf66ee..d0c932ad0e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1131,6 +1131,8 @@ SELECT
        b.read_time,
        b.writes,
        b.write_time,
+       b.writebacks,
+       b.writeback_time,
        b.extends,
        b.extend_time,
        b.op_bytes,
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..dbf866f262 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -129,7 +129,12 @@ BackgroundWriterMain(void)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
-	WritebackContextInit(&wb_context, &bgwriter_flush_after);
+	/*
+	 * bgwriter will only request writeback for permanent relations in
+	 * IOCONTEXT_NORMAL.
+	 */
+	WritebackContextInit(&wb_context, IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+						 &bgwriter_flush_after);
 
 	/*
 	 * If an exception is encountered, processing resumes here.
@@ -185,7 +190,8 @@ BackgroundWriterMain(void)
 		MemoryContextResetAndDeleteChildren(bgwriter_context);
 
 		/* re-initialize to avoid repeated errors causing problems */
-		WritebackContextInit(&wb_context, &bgwriter_flush_after);
+		WritebackContextInit(&wb_context, IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+							 &bgwriter_flush_after);
 
 		/* Now we can allow interrupts again */
 		RESUME_INTERRUPTS();
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0057443f0c..b050e3c2aa 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -145,9 +145,14 @@ InitBufferPool(void)
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
 
-	/* Initialize per-backend file flush context */
-	WritebackContextInit(&BackendWritebackContext,
-						 &backend_flush_after);
+	/*
+	 * Initialize per-backend file flush context. IOObject is initialized to
+	 * IOOBJECT_RELATION and IOContext to IOCONTEXT_NORMAL since these are the
+	 * most likely targets for writeback. The backend can overwrite these as
+	 * appropriate.
+	 */
+	WritebackContextInit(&BackendWritebackContext, IOOBJECT_RELATION,
+						 IOCONTEXT_NORMAL, &backend_flush_after);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..116910cdfe 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1685,6 +1685,8 @@ again:
 		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
 		LWLockRelease(content_lock);
 
+		BackendWritebackContext.io_object = IOOBJECT_RELATION;
+		BackendWritebackContext.io_context = io_context;
 		ScheduleBufferTagForWriteback(&BackendWritebackContext,
 									  &buf_hdr->tag);
 	}
@@ -2555,7 +2557,12 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
-	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	/*
+	 * Checkpointer will only request writeback of permanent relations in
+	 * IOCONTEXT_NORMAL.
+	 */
+	WritebackContextInit(&wb_context, IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+						 &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -5433,10 +5440,13 @@ ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
  * writeback control will be performed.
  */
 void
-WritebackContextInit(WritebackContext *context, int *max_pending)
+WritebackContextInit(WritebackContext *context, IOObject io_object,
+					 IOContext io_context, int *max_pending)
 {
 	Assert(*max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);
 
+	context->io_object = io_object;
+	context->io_context = io_context;
 	context->max_pending = max_pending;
 	context->nr_pending = 0;
 }
@@ -5491,6 +5501,7 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 void
 IssuePendingWritebacks(WritebackContext *context)
 {
+	instr_time	io_start;
 	int			i;
 
 	if (context->nr_pending == 0)
@@ -5502,6 +5513,8 @@ IssuePendingWritebacks(WritebackContext *context)
 	 */
 	sort_pending_writebacks(context->pending_writebacks, context->nr_pending);
 
+	io_start = pgstat_prepare_io_time();
+
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
 	 * through the, now sorted, array of pending flushes, and look forward to
@@ -5555,6 +5568,9 @@ IssuePendingWritebacks(WritebackContext *context)
 		smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);
 	}
 
+	pgstat_count_io_op_time(context->io_object, context->io_context,
+							IOOP_WRITEBACK, io_start, context->nr_pending);
+
 	context->nr_pending = 0;
 }
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0d57b6a7c0..70da0a2de1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1268,6 +1268,8 @@ typedef enum io_stat_col
 	IO_COL_READ_TIME,
 	IO_COL_WRITES,
 	IO_COL_WRITE_TIME,
+	IO_COL_WRITEBACKS,
+	IO_COL_WRITEBACK_TIME,
 	IO_COL_EXTENDS,
 	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
@@ -1303,6 +1305,8 @@ pgstat_get_io_op_index(IOOp io_op)
 			return IO_COL_REUSES;
 		case IOOP_WRITE:
 			return IO_COL_WRITES;
+		case IOOP_WRITEBACK:
+			return IO_COL_WRITEBACKS;
 	}
 
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
@@ -1322,6 +1326,7 @@ pgstat_get_io_time_index(IOOp io_op)
 	{
 		case IOOP_READ:
 		case IOOP_WRITE:
+		case IOOP_WRITEBACK:
 		case IOOP_EXTEND:
 		case IOOP_FSYNC:
 			return pgstat_get_io_op_index(io_op) + 1;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b2bc81b15f..e766b9b600 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5760,9 +5760,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,writebacks,writeback_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff4ad5b6d..57a2c0866a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -300,9 +300,10 @@ typedef enum IOOp
 	IOOP_READ,
 	IOOP_REUSE,
 	IOOP_WRITE,
+	IOOP_WRITEBACK,
 } IOOp;
 
-#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+#define IOOP_NUM_TYPES (IOOP_WRITEBACK + 1)
 
 typedef struct PgStat_BktypeIO
 {
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 136cf8fbaf..0921710605 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -295,6 +295,9 @@ typedef struct PendingWriteback
 /* struct forward declared in bufmgr.h */
 typedef struct WritebackContext
 {
+	IOObject	io_object;
+	IOContext	io_context;
+
 	/* pointer to the max number of writeback requests to coalesce */
 	int		   *max_pending;
 
@@ -387,7 +390,8 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
  * Internal buffer management routines
  */
 /* bufmgr.c */
-extern void WritebackContextInit(WritebackContext *context, int *max_pending);
+extern void WritebackContextInit(WritebackContext *context, IOObject io_object,
+								 IOContext io_context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 69957687fe..0bd6db2bf8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1887,6 +1887,8 @@ pg_stat_io| SELECT backend_type,
     read_time,
     writes,
     write_time,
+    writebacks,
+    writeback_time,
     extends,
     extend_time,
     op_bytes,
@@ -1896,7 +1898,7 @@ pg_stat_io| SELECT backend_type,
     fsyncs,
     fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, writebacks, writeback_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 813d6d39ea..26ff27911a 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1384,7 +1384,7 @@ SELECT pg_stat_have_stats('io', 0, 0);
  t
 (1 row)
 
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_pre_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
   FROM pg_stat_io \gset
 SELECT pg_stat_reset_shared('io');
  pg_stat_reset_shared 
@@ -1392,7 +1392,7 @@ SELECT pg_stat_reset_shared('io');
  
 (1 row)
 
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_post_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
   FROM pg_stat_io \gset
 SELECT :io_stats_post_reset < :io_stats_pre_reset;
  ?column? 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 99a28bb79c..257e91d10d 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -694,10 +694,10 @@ SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_ext
 
 -- Test IO stats reset
 SELECT pg_stat_have_stats('io', 0, 0);
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_pre_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
   FROM pg_stat_io \gset
 SELECT pg_stat_reset_shared('io');
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_post_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
   FROM pg_stat_io \gset
 SELECT :io_stats_post_reset < :io_stats_pre_reset;
 
-- 
2.37.2

#15

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 2 years ago

In reply to: Melanie Plageman (#14)

Re: pg_stat_io not tracking smgrwriteback() is confusing

At Wed, 26 Apr 2023 17:08:14 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

On Mon, Apr 24, 2023 at 9:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I've yet to cook up a client backend test case (e.g. with COPY). I've taken
that as a todo.

It was trivial to see client backend writebacks in almost any scenario
once I set backend_flush_after. I wonder if it is worth mentioning the
various "*flush_after" gucs in the docs?

I have a few outstanding questions:

1) Does it make sense for writebacks to count the number of blocks for
which writeback was requested or the number of calls to smgrwriteback() or
the number of actual syscalls made? We don't actually know from outside
of mdwriteback() how many FileWriteback() calls we will make.

So, in the attached v3, I've kept the first method: writebacks are the
number of blocks which the backend has requested writeback of. I'd like
it to be clear in the docs exactly what writebacks are (so that people
know not to add them together with writes or something like that). I
made an effort but could use further docs review.

+        Number of units of size <varname>op_bytes</varname> which the backend
+        requested the kernel write out to permanent storage.

I just want to mention that it is not necessarily the same as the
number of system calls, but I'm not sure what others think about that.

+        Time spent in writeback operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
+        does not necessarily count the time spent by the kernel writing the
+        data out. The backend will initiate write-out of the dirty pages and
+        wait only if the request queue is full.

The last sentence looks like it's taken from the sync_file_range() man
page, but I think it's a bit too detailed. We could just say, "The
time usually only includes the time it takes to queue write-out
requests.", bit I'm not sure wh...

2) I'm a little nervous about not including IOObject in the writeback
context. Technically, there is nothing stopping local buffer code from
calling IssuePendingWritebacks(). Right now, local buffer code doesn't
do ScheduleBufferTagForWriteback(). But it doesn't seem quite right to
hardcode in IOOBJECT_RELATION when there is nothing wrong with
requesting writeback of local buffers (AFAIK). What do you think?

I've gone ahead and added IOObject to the WritebackContext.

The smgropen call in IssuePendingWritebacks below clearly shows that
the function only deals with shared buffers.

/* and finally tell the kernel to write the data to storage */
reln = smgropen(currlocator, InvalidBackendId);
smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);

The callback-related code fully depends on callers following its
expectation. So we can rewrite the following comment added to
InitBufferPoll with a more confident tone.

+	 * Initialize per-backend file flush context. IOObject is initialized to
+	 * IOOBJECT_RELATION and IOContext to IOCONTEXT_NORMAL since these are the
+	 * most likely targets for writeback. The backend can overwrite these as
+	 * appropriate.

Or I actually think we might not even need to pass around the io_*
parameters and could just pass immediate values to the
pgstat_count_io_op_time call. If we ever start using shared buffers
for thing other than relation files (for example SLRU?), we'll have to
consider the target individually for each buffer block. That being
said, I'm fine with how it is either.

Regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16

melanieplageman@gmail.com

over 2 years ago

In reply to: Kyotaro Horiguchi (#15)

1 attachment(s)

Re: pg_stat_io not tracking smgrwriteback() is confusing

Thanks for the review!

On Wed, Apr 26, 2023 at 10:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 26 Apr 2023 17:08:14 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

On Mon, Apr 24, 2023 at 9:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I've yet to cook up a client backend test case (e.g. with COPY). I've taken
that as a todo.

It was trivial to see client backend writebacks in almost any scenario
once I set backend_flush_after. I wonder if it is worth mentioning the
various "*flush_after" gucs in the docs?

I have a few outstanding questions:

1) Does it make sense for writebacks to count the number of blocks for
which writeback was requested or the number of calls to smgrwriteback() or
the number of actual syscalls made? We don't actually know from outside
of mdwriteback() how many FileWriteback() calls we will make.

So, in the attached v3, I've kept the first method: writebacks are the
number of blocks which the backend has requested writeback of. I'd like
it to be clear in the docs exactly what writebacks are (so that people
know not to add them together with writes or something like that). I
made an effort but could use further docs review.
+        Number of units of size <varname>op_bytes</varname> which the backend
+        requested the kernel write out to permanent storage.
I just want to mention that it is not necessarily the same as the
number of system calls, but I'm not sure what others think about that.

My thinking is that some other IO operations, for example, extends,
count the number of blocks extended and not the number of syscalls.

+        Time spent in writeback operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
+        does not necessarily count the time spent by the kernel writing the
+        data out. The backend will initiate write-out of the dirty pages and
+        wait only if the request queue is full.
The last sentence looks like it's taken from the sync_file_range() man
page, but I think it's a bit too detailed. We could just say, "The
time usually only includes the time it takes to queue write-out
requests.", bit I'm not sure wh...

Ah, yes, I indeed took heavy inspiration from the sync_file_range()
man page :) I've modified this comment in the attached v4. I didn't want
to say "usually" since I imagine it is quite workload and configuration
dependent.

2) I'm a little nervous about not including IOObject in the writeback
context. Technically, there is nothing stopping local buffer code from
calling IssuePendingWritebacks(). Right now, local buffer code doesn't
do ScheduleBufferTagForWriteback(). But it doesn't seem quite right to
hardcode in IOOBJECT_RELATION when there is nothing wrong with
requesting writeback of local buffers (AFAIK). What do you think?

I've gone ahead and added IOObject to the WritebackContext.

The smgropen call in IssuePendingWritebacks below clearly shows that
the function only deals with shared buffers.

/* and finally tell the kernel to write the data to storage */
reln = smgropen(currlocator, InvalidBackendId);
smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);

Yes, as it is currently, IssuePendingWritebacks() is only used for shared
buffers. My rationale for including IOObject is that localbuf.c calls
smgr* functions and there isn't anything stopping it from calling
smgrwriteback() or using WritebackContexts (AFAICT).

The callback-related code fully depends on callers following its
expectation. So we can rewrite the following comment added to
InitBufferPoll with a more confident tone.
+        * Initialize per-backend file flush context. IOObject is initialized to
+        * IOOBJECT_RELATION and IOContext to IOCONTEXT_NORMAL since these are the
+        * most likely targets for writeback. The backend can overwrite these as
+        * appropriate.

I have updated this comment to be more confident and specific.

Or I actually think we might not even need to pass around the io_*
parameters and could just pass immediate values to the
pgstat_count_io_op_time call. If we ever start using shared buffers
for thing other than relation files (for example SLRU?), we'll have to
consider the target individually for each buffer block. That being
said, I'm fine with how it is either.

In IssuePendingWritebacks() we don't actually know which IOContext we
are issuing writebacks for when we call pgstat_count_io_op_time() (we do
issue pending writebacks for other IOContexts than IOCONTEXT_NORMAL). I
agree IOObject is not strictly necessary right now. I've kept IOObject a
member of WritebackContext for the reasons I mention above, however, I
am open to removing it if it adds confusion.

- Melanie

Attachments:

v4-0001-Add-writeback-to-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Add-writeback-to-pg_stat_io.patchDownload

From 7cdd6fc78ed82180a705ab9667714f80d08c5f7d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 24 Apr 2023 18:21:54 -0400
Subject: [PATCH v4] Add writeback to pg_stat_io

28e626bde00 added the notion of IOOps but neglected to include
writeback. With the addition of IO timing to pg_stat_io in ac8d53dae5,
the omission of writeback caused some confusion. Checkpointer write
timing in pg_stat_io often differed greatly from the write timing
written to the log. To fix this, add IOOp IOOP_WRITEBACK and track
writebacks and writeback timing in pg_stat_io.

Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de
---
 doc/src/sgml/monitoring.sgml          | 26 ++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql  |  2 ++
 src/backend/postmaster/bgwriter.c     | 10 ++++++++--
 src/backend/storage/buffer/buf_init.c | 12 +++++++++---
 src/backend/storage/buffer/bufmgr.c   | 20 ++++++++++++++++++--
 src/backend/utils/adt/pgstatfuncs.c   |  5 +++++
 src/include/catalog/pg_proc.dat       |  6 +++---
 src/include/pgstat.h                  |  3 ++-
 src/include/storage/buf_internals.h   |  6 +++++-
 src/test/regress/expected/rules.out   |  4 +++-
 src/test/regress/expected/stats.out   |  4 ++--
 src/test/regress/sql/stats.sql        |  4 ++--
 12 files changed, 85 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 99f7f95c39..27b6f1a0a0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3867,6 +3867,32 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writebacks</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of units of size <varname>op_bytes</varname> which the backend
+        requested the kernel write out to permanent storage.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writeback_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in writeback operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
+        includes the time spent queueing write-out requests and, potentially,
+        the time spent to write out the dirty data.
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 48aacf66ee..d0c932ad0e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1131,6 +1131,8 @@ SELECT
        b.read_time,
        b.writes,
        b.write_time,
+       b.writebacks,
+       b.writeback_time,
        b.extends,
        b.extend_time,
        b.op_bytes,
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..dbf866f262 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -129,7 +129,12 @@ BackgroundWriterMain(void)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
-	WritebackContextInit(&wb_context, &bgwriter_flush_after);
+	/*
+	 * bgwriter will only request writeback for permanent relations in
+	 * IOCONTEXT_NORMAL.
+	 */
+	WritebackContextInit(&wb_context, IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+						 &bgwriter_flush_after);
 
 	/*
 	 * If an exception is encountered, processing resumes here.
@@ -185,7 +190,8 @@ BackgroundWriterMain(void)
 		MemoryContextResetAndDeleteChildren(bgwriter_context);
 
 		/* re-initialize to avoid repeated errors causing problems */
-		WritebackContextInit(&wb_context, &bgwriter_flush_after);
+		WritebackContextInit(&wb_context, IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+							 &bgwriter_flush_after);
 
 		/* Now we can allow interrupts again */
 		RESUME_INTERRUPTS();
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0057443f0c..a7182fe95a 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -145,9 +145,15 @@ InitBufferPool(void)
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
 
-	/* Initialize per-backend file flush context */
-	WritebackContextInit(&BackendWritebackContext,
-						 &backend_flush_after);
+	/*
+	 * Initialize per-backend file flush context. IOContext is initialized to
+	 * IOCONTEXT_NORMAL because this is the most common context. IOObject is
+	 * initialized to IOOBJECT_RELATION because writeback is currently only
+	 * requested for permanent relations in shared buffers. The backend can
+	 * overwrite these as appropriate.
+	 */
+	WritebackContextInit(&BackendWritebackContext, IOOBJECT_RELATION,
+						 IOCONTEXT_NORMAL, &backend_flush_after);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..116910cdfe 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1685,6 +1685,8 @@ again:
 		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
 		LWLockRelease(content_lock);
 
+		BackendWritebackContext.io_object = IOOBJECT_RELATION;
+		BackendWritebackContext.io_context = io_context;
 		ScheduleBufferTagForWriteback(&BackendWritebackContext,
 									  &buf_hdr->tag);
 	}
@@ -2555,7 +2557,12 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
-	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	/*
+	 * Checkpointer will only request writeback of permanent relations in
+	 * IOCONTEXT_NORMAL.
+	 */
+	WritebackContextInit(&wb_context, IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+						 &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -5433,10 +5440,13 @@ ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
  * writeback control will be performed.
  */
 void
-WritebackContextInit(WritebackContext *context, int *max_pending)
+WritebackContextInit(WritebackContext *context, IOObject io_object,
+					 IOContext io_context, int *max_pending)
 {
 	Assert(*max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);
 
+	context->io_object = io_object;
+	context->io_context = io_context;
 	context->max_pending = max_pending;
 	context->nr_pending = 0;
 }
@@ -5491,6 +5501,7 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 void
 IssuePendingWritebacks(WritebackContext *context)
 {
+	instr_time	io_start;
 	int			i;
 
 	if (context->nr_pending == 0)
@@ -5502,6 +5513,8 @@ IssuePendingWritebacks(WritebackContext *context)
 	 */
 	sort_pending_writebacks(context->pending_writebacks, context->nr_pending);
 
+	io_start = pgstat_prepare_io_time();
+
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
 	 * through the, now sorted, array of pending flushes, and look forward to
@@ -5555,6 +5568,9 @@ IssuePendingWritebacks(WritebackContext *context)
 		smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);
 	}
 
+	pgstat_count_io_op_time(context->io_object, context->io_context,
+							IOOP_WRITEBACK, io_start, context->nr_pending);
+
 	context->nr_pending = 0;
 }
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0d57b6a7c0..70da0a2de1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1268,6 +1268,8 @@ typedef enum io_stat_col
 	IO_COL_READ_TIME,
 	IO_COL_WRITES,
 	IO_COL_WRITE_TIME,
+	IO_COL_WRITEBACKS,
+	IO_COL_WRITEBACK_TIME,
 	IO_COL_EXTENDS,
 	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
@@ -1303,6 +1305,8 @@ pgstat_get_io_op_index(IOOp io_op)
 			return IO_COL_REUSES;
 		case IOOP_WRITE:
 			return IO_COL_WRITES;
+		case IOOP_WRITEBACK:
+			return IO_COL_WRITEBACKS;
 	}
 
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
@@ -1322,6 +1326,7 @@ pgstat_get_io_time_index(IOOp io_op)
 	{
 		case IOOP_READ:
 		case IOOP_WRITE:
+		case IOOP_WRITEBACK:
 		case IOOP_EXTEND:
 		case IOOP_FSYNC:
 			return pgstat_get_io_op_index(io_op) + 1;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b2bc81b15f..e766b9b600 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5760,9 +5760,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,writebacks,writeback_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff4ad5b6d..57a2c0866a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -300,9 +300,10 @@ typedef enum IOOp
 	IOOP_READ,
 	IOOP_REUSE,
 	IOOP_WRITE,
+	IOOP_WRITEBACK,
 } IOOp;
 
-#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+#define IOOP_NUM_TYPES (IOOP_WRITEBACK + 1)
 
 typedef struct PgStat_BktypeIO
 {
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 136cf8fbaf..0921710605 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -295,6 +295,9 @@ typedef struct PendingWriteback
 /* struct forward declared in bufmgr.h */
 typedef struct WritebackContext
 {
+	IOObject	io_object;
+	IOContext	io_context;
+
 	/* pointer to the max number of writeback requests to coalesce */
 	int		   *max_pending;
 
@@ -387,7 +390,8 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
  * Internal buffer management routines
  */
 /* bufmgr.c */
-extern void WritebackContextInit(WritebackContext *context, int *max_pending);
+extern void WritebackContextInit(WritebackContext *context, IOObject io_object,
+								 IOContext io_context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 69957687fe..0bd6db2bf8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1887,6 +1887,8 @@ pg_stat_io| SELECT backend_type,
     read_time,
     writes,
     write_time,
+    writebacks,
+    writeback_time,
     extends,
     extend_time,
     op_bytes,
@@ -1896,7 +1898,7 @@ pg_stat_io| SELECT backend_type,
     fsyncs,
     fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, writebacks, writeback_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 813d6d39ea..26ff27911a 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1384,7 +1384,7 @@ SELECT pg_stat_have_stats('io', 0, 0);
  t
 (1 row)
 
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_pre_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
   FROM pg_stat_io \gset
 SELECT pg_stat_reset_shared('io');
  pg_stat_reset_shared 
@@ -1392,7 +1392,7 @@ SELECT pg_stat_reset_shared('io');
  
 (1 row)
 
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_post_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
   FROM pg_stat_io \gset
 SELECT :io_stats_post_reset < :io_stats_pre_reset;
  ?column? 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 99a28bb79c..257e91d10d 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -694,10 +694,10 @@ SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_ext
 
 -- Test IO stats reset
 SELECT pg_stat_have_stats('io', 0, 0);
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_pre_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
   FROM pg_stat_io \gset
 SELECT pg_stat_reset_shared('io');
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_post_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
   FROM pg_stat_io \gset
 SELECT :io_stats_post_reset < :io_stats_pre_reset;
 
-- 
2.37.2

#17

jkatz@postgresql.org

over 2 years ago

In reply to: Melanie Plageman (#16)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On 4/27/23 11:36 AM, Melanie Plageman wrote:

Thanks for the review!

On Wed, Apr 26, 2023 at 10:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Wed, 26 Apr 2023 17:08:14 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

On Mon, Apr 24, 2023 at 9:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I've yet to cook up a client backend test case (e.g. with COPY). I've taken
that as a todo.

It was trivial to see client backend writebacks in almost any scenario
once I set backend_flush_after. I wonder if it is worth mentioning the
various "*flush_after" gucs in the docs?

I have a few outstanding questions:

1) Does it make sense for writebacks to count the number of blocks for
which writeback was requested or the number of calls to smgrwriteback() or
the number of actual syscalls made? We don't actually know from outside
of mdwriteback() how many FileWriteback() calls we will make.

So, in the attached v3, I've kept the first method: writebacks are the
number of blocks which the backend has requested writeback of. I'd like
it to be clear in the docs exactly what writebacks are (so that people
know not to add them together with writes or something like that). I
made an effort but could use further docs review.
+        Number of units of size <varname>op_bytes</varname> which the backend
+        requested the kernel write out to permanent storage.
I just want to mention that it is not necessarily the same as the
number of system calls, but I'm not sure what others think about that.
My thinking is that some other IO operations, for example, extends,
count the number of blocks extended and not the number of syscalls.
+        Time spent in writeback operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
+        does not necessarily count the time spent by the kernel writing the
+        data out. The backend will initiate write-out of the dirty pages and
+        wait only if the request queue is full.
The last sentence looks like it's taken from the sync_file_range() man
page, but I think it's a bit too detailed. We could just say, "The
time usually only includes the time it takes to queue write-out
requests.", bit I'm not sure wh...
Ah, yes, I indeed took heavy inspiration from the sync_file_range()
man page :) I've modified this comment in the attached v4. I didn't want
to say "usually" since I imagine it is quite workload and configuration
dependent.

2) I'm a little nervous about not including IOObject in the writeback
context. Technically, there is nothing stopping local buffer code from
calling IssuePendingWritebacks(). Right now, local buffer code doesn't
do ScheduleBufferTagForWriteback(). But it doesn't seem quite right to
hardcode in IOOBJECT_RELATION when there is nothing wrong with
requesting writeback of local buffers (AFAIK). What do you think?

I've gone ahead and added IOObject to the WritebackContext.

The smgropen call in IssuePendingWritebacks below clearly shows that
the function only deals with shared buffers.

/* and finally tell the kernel to write the data to storage */
reln = smgropen(currlocator, InvalidBackendId);
smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);

Yes, as it is currently, IssuePendingWritebacks() is only used for shared
buffers. My rationale for including IOObject is that localbuf.c calls
smgr* functions and there isn't anything stopping it from calling
smgrwriteback() or using WritebackContexts (AFAICT).
The callback-related code fully depends on callers following its
expectation. So we can rewrite the following comment added to
InitBufferPoll with a more confident tone.
+        * Initialize per-backend file flush context. IOObject is initialized to
+        * IOOBJECT_RELATION and IOContext to IOCONTEXT_NORMAL since these are the
+        * most likely targets for writeback. The backend can overwrite these as
+        * appropriate.
I have updated this comment to be more confident and specific.

Or I actually think we might not even need to pass around the io_*
parameters and could just pass immediate values to the
pgstat_count_io_op_time call. If we ever start using shared buffers
for thing other than relation files (for example SLRU?), we'll have to
consider the target individually for each buffer block. That being
said, I'm fine with how it is either.

In IssuePendingWritebacks() we don't actually know which IOContext we
are issuing writebacks for when we call pgstat_count_io_op_time() (we do
issue pending writebacks for other IOContexts than IOCONTEXT_NORMAL). I
agree IOObject is not strictly necessary right now. I've kept IOObject a
member of WritebackContext for the reasons I mention above, however, I
am open to removing it if it adds confusion.

[RMT hat]

Horiguchi-san: do the changes that Melanie made address your feedback?

It'd be good if we can get this into Beta 1 if everyone is comfortable
with the patch.

Thanks,

Jonathan

#18

andres@anarazel.de

over 2 years ago

In reply to: Melanie Plageman (#16)

Re: pg_stat_io not tracking smgrwriteback() is confusing

Hi,

On 2023-04-27 11:36:49 -0400, Melanie Plageman wrote:

/* and finally tell the kernel to write the data to storage */
reln = smgropen(currlocator, InvalidBackendId);
smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);

Yes, as it is currently, IssuePendingWritebacks() is only used for shared
buffers. My rationale for including IOObject is that localbuf.c calls
smgr* functions and there isn't anything stopping it from calling
smgrwriteback() or using WritebackContexts (AFAICT).

I think it's extremely unlikely that we'll ever do that, because it's very
common to have temp tables that are bigger than temp_buffers. We basically
hope that the kernel can do good caching for us there.

Or I actually think we might not even need to pass around the io_*
parameters and could just pass immediate values to the
pgstat_count_io_op_time call. If we ever start using shared buffers
for thing other than relation files (for example SLRU?), we'll have to
consider the target individually for each buffer block. That being
said, I'm fine with how it is either.

In IssuePendingWritebacks() we don't actually know which IOContext we
are issuing writebacks for when we call pgstat_count_io_op_time() (we do
issue pending writebacks for other IOContexts than IOCONTEXT_NORMAL). I
agree IOObject is not strictly necessary right now. I've kept IOObject a
member of WritebackContext for the reasons I mention above, however, I
am open to removing it if it adds confusion.

I don't think it's really worth adding struct members for potential future
safety. We can just add them later if we end up needing them.

From 7cdd6fc78ed82180a705ab9667714f80d08c5f7d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 24 Apr 2023 18:21:54 -0400
Subject: [PATCH v4] Add writeback to pg_stat_io

28e626bde00 added the notion of IOOps but neglected to include
writeback. With the addition of IO timing to pg_stat_io in ac8d53dae5,
the omission of writeback caused some confusion. Checkpointer write
timing in pg_stat_io often differed greatly from the write timing
written to the log. To fix this, add IOOp IOOP_WRITEBACK and track
writebacks and writeback timing in pg_stat_io.

For the future: It'd be good to note that catversion needs to be increased.

Off-topic: I've been wondering about computing a "catversion hash" based on
all the files going into initdb. At least to have some tooling to detect
missing catversion increases...

index 99f7f95c39..27b6f1a0a0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3867,6 +3867,32 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
</entry>
</row>

+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writebacks</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of units of size <varname>op_bytes</varname> which the backend
+        requested the kernel write out to permanent storage.
+       </para>
+      </entry>
+     </row>

I think the reference to "backend" here is somewhat misplaced - it could be
checkpointer or bgwriter as well. We don't reference the backend in other
comparable columns of pgsio either...

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0057443f0c..a7182fe95a 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -145,9 +145,15 @@ InitBufferPool(void)
/* Init other shared buffer-management stuff */
StrategyInitialize(!foundDescs);

-	/* Initialize per-backend file flush context */
-	WritebackContextInit(&BackendWritebackContext,
-						 &backend_flush_after);
+	/*
+	 * Initialize per-backend file flush context. IOContext is initialized to
+	 * IOCONTEXT_NORMAL because this is the most common context. IOObject is
+	 * initialized to IOOBJECT_RELATION because writeback is currently only
+	 * requested for permanent relations in shared buffers. The backend can
+	 * overwrite these as appropriate.
+	 */
+	WritebackContextInit(&BackendWritebackContext, IOOBJECT_RELATION,
+						 IOCONTEXT_NORMAL, &backend_flush_after);
}

This seems somewhat icky.

/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..116910cdfe 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1685,6 +1685,8 @@ again:
FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
LWLockRelease(content_lock);

+		BackendWritebackContext.io_object = IOOBJECT_RELATION;
+		BackendWritebackContext.io_context = io_context;
ScheduleBufferTagForWriteback(&BackendWritebackContext,
&buf_hdr->tag);
}

What about passing the io_context to ScheduleBufferTagForWriteback instead?

--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql

Hm. Could we add a test for this? While it's not implemented everywhere, we
still issue the smgrwriteback() afaics. The default for the _flush_after GUCs
changes, but backend_flush_after is USERSET, so we could just change it for a
single command.

Greetings,

Andres Freund

#19

andres@anarazel.de

over 2 years ago

In reply to: Jonathan S. Katz (#17)

Re: pg_stat_io not tracking smgrwriteback() is confusing

Hi,

On 2023-05-03 11:36:10 -0400, Jonathan S. Katz wrote:

It'd be good if we can get this into Beta 1 if everyone is comfortable with
the patch.

I think we need one more iteration, then I think it can be committed. The
changes are docs phrasing and polishing the API a bit, which shouldn't be too
hard. I'll try to look more tomorrow.

Greetings,

Andres Freund

#20

andres@anarazel.de

over 2 years ago

In reply to: Melanie Plageman (#13)

Re: pg_stat_io not tracking smgrwriteback() is confusing

Hi,

On 2023-04-24 21:29:48 -0400, Melanie Plageman wrote:

1) Does it make sense for writebacks to count the number of blocks for
which writeback was requested or the number of calls to smgrwriteback() or
the number of actual syscalls made? We don't actually know from outside
of mdwriteback() how many FileWriteback() calls we will make.

The number of blocks is the right thing IMO. The rest is just some
micro-optimization.

2) I'm a little nervous about not including IOObject in the writeback
context. Technically, there is nothing stopping local buffer code from
calling IssuePendingWritebacks(). Right now, local buffer code doesn't
do ScheduleBufferTagForWriteback(). But it doesn't seem quite right to
hardcode in IOOBJECT_RELATION when there is nothing wrong with
requesting writeback of local buffers (AFAIK). What do you think?

I think it'd be wrong on performance grounds ;). We could add an assertion to
ScheduleBufferTagForWriteback(), I guess, to document that fact?

3) Should any restrictions be added to pgstat_tracks_io_object() or
pgstat_tracks_io_op()? I couldn't think of any backend types or IO
contexts which would not do writeback as a rule. Also, though we don't
do writeback for temp tables now, it isn't nonsensical to do so. In
this version, I didn't add any restrictions.

I think the temp table restriction could be encoded for now, I don't forsee
that changing anytime soon.

About other restrictions: Anything that triggers catalog access can trigger
buffers to be written back. Checkpointer and bgwriter don't do catalog access,
but have explicit writeback calls. WAL receiver, WAL writer, syslogger and
archiver are excluded on a backend type basis. I think the startup process
will use the normal backend logic, and thus also trigger writebacks?

Greetings,

Andres Freund

#21

jkatz@postgresql.org

over 2 years ago

In reply to: Andres Freund (#19)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On 5/4/23 12:46 PM, Andres Freund wrote:

Hi,

On 2023-05-03 11:36:10 -0400, Jonathan S. Katz wrote:

It'd be good if we can get this into Beta 1 if everyone is comfortable with
the patch.

I think we need one more iteration, then I think it can be committed. The
changes are docs phrasing and polishing the API a bit, which shouldn't be too
hard. I'll try to look more tomorrow.

[RMT hat, personal opinion]

Great to hear. From my skim of the patch, I had thought the conclusion
would be something similar, but did want to hear from you & Melanie on that.

Thanks,

Jonathan

#22

melanieplageman@gmail.com

over 2 years ago

In reply to: Andres Freund (#20)

3 attachment(s)

Re: pg_stat_io not tracking smgrwriteback() is confusing

v5 attached.

On Thu, May 4, 2023 at 12:44 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-27 11:36:49 -0400, Melanie Plageman wrote:

/* and finally tell the kernel to write the data to storage */
reln = smgropen(currlocator, InvalidBackendId);
smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);

Yes, as it is currently, IssuePendingWritebacks() is only used for shared
buffers. My rationale for including IOObject is that localbuf.c calls
smgr* functions and there isn't anything stopping it from calling
smgrwriteback() or using WritebackContexts (AFAICT).

I think it's extremely unlikely that we'll ever do that, because it's very
common to have temp tables that are bigger than temp_buffers. We basically
hope that the kernel can do good caching for us there.

Or I actually think we might not even need to pass around the io_*
parameters and could just pass immediate values to the
pgstat_count_io_op_time call. If we ever start using shared buffers
for thing other than relation files (for example SLRU?), we'll have to
consider the target individually for each buffer block. That being
said, I'm fine with how it is either.

In IssuePendingWritebacks() we don't actually know which IOContext we
are issuing writebacks for when we call pgstat_count_io_op_time() (we do
issue pending writebacks for other IOContexts than IOCONTEXT_NORMAL). I
agree IOObject is not strictly necessary right now. I've kept IOObject a
member of WritebackContext for the reasons I mention above, however, I
am open to removing it if it adds confusion.

I don't think it's really worth adding struct members for potential future
safety. We can just add them later if we end up needing them.

I've removed both members of WritebackContext and hard-coded
IOOBJECT_RELATION in the call to pgstat_count_io_op_time().

From 7cdd6fc78ed82180a705ab9667714f80d08c5f7d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 24 Apr 2023 18:21:54 -0400
Subject: [PATCH v4] Add writeback to pg_stat_io

28e626bde00 added the notion of IOOps but neglected to include
writeback. With the addition of IO timing to pg_stat_io in ac8d53dae5,
the omission of writeback caused some confusion. Checkpointer write
timing in pg_stat_io often differed greatly from the write timing
written to the log. To fix this, add IOOp IOOP_WRITEBACK and track
writebacks and writeback timing in pg_stat_io.

For the future: It'd be good to note that catversion needs to be increased.

Noted. I've added it to the commit message since I did a new version
anyway.

index 99f7f95c39..27b6f1a0a0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3867,6 +3867,32 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
</entry>
</row>

+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writebacks</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of units of size <varname>op_bytes</varname> which the backend
+        requested the kernel write out to permanent storage.
+       </para>
+      </entry>
+     </row>

I think the reference to "backend" here is somewhat misplaced - it could be
checkpointer or bgwriter as well. We don't reference the backend in other
comparable columns of pgsio either...

So, I tried to come up with something that doesn't make reference to
any "requester" of the writeback and the best I could do was:

"Number of units of size op_bytes requested that the kernel write out."

This is awfully awkward sounding.

"backend_type" is the name of the column in pg_stat_io. Client backends
are always referred to as such in the pg_stat_io documentation. Thus, I
think it is reasonable to use the word "backend" and assume people
understand it could be any type of backend.

However, since the existing docs for pg_stat_bgwriter use "backend" to
mean "client backend", and I see a few uses of the word "process" in the
stats docs, I've changed my use of the word "backend" to "process".

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0057443f0c..a7182fe95a 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -145,9 +145,15 @@ InitBufferPool(void)
/* Init other shared buffer-management stuff */
StrategyInitialize(!foundDescs);

-     /* Initialize per-backend file flush context */
-     WritebackContextInit(&BackendWritebackContext,
-                                              &backend_flush_after);
+     /*
+      * Initialize per-backend file flush context. IOContext is initialized to
+      * IOCONTEXT_NORMAL because this is the most common context. IOObject is
+      * initialized to IOOBJECT_RELATION because writeback is currently only
+      * requested for permanent relations in shared buffers. The backend can
+      * overwrite these as appropriate.
+      */
+     WritebackContextInit(&BackendWritebackContext, IOOBJECT_RELATION,
+                                              IOCONTEXT_NORMAL, &backend_flush_after);
}

This seems somewhat icky.

I've removed both IOObject and IOContext from WritebackContext.

/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..116910cdfe 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1685,6 +1685,8 @@ again:
FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
LWLockRelease(content_lock);

+             BackendWritebackContext.io_object = IOOBJECT_RELATION;
+             BackendWritebackContext.io_context = io_context;
ScheduleBufferTagForWriteback(&BackendWritebackContext,
&buf_hdr->tag);
}

What about passing the io_context to ScheduleBufferTagForWriteback instead?

I assume we don't want to include the time spent in
sort_pending_writebacks(), so I've added io_context as a parameter to
ScheduleBufferTagForWriteback and threaded it through
IssuePendingWritebacks() as well.

Because WritebackContext was just called "context" as a function
parameter to these functions and that was easy to confuse with
"io_context", I've changed the name of the WritebackContext function
parameter to "wb_context". I separated this rename into its own commit so
that the diff for the commit adding writeback is more clear. I assume
the committer will squash those commits.

--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
Hm. Could we add a test for this? While it's not implemented everywhere, we
still issue the smgrwriteback() afaics. The default for the _flush_after GUCs
changes, but backend_flush_after is USERSET, so we could just change it for a
single command.

I couldn't come up with a way to write a test for this.
GetVictimBuffer() is only called when flushing a dirty buffer. I tried
adding backend_flush_after = 1 and sum(writebacks) to the test for
reuses with vacuum strategy, but there were never more writebacks (in
any context) after doing the full table rewrite with VACUUM. I presume
this is because checkpointer or bgwriter is writing out the dirty
buffers before our client backend gets to reusing them. And, since
bgwriter/checkpointer_flush_after are not USERSET, I don't think we can
guarantee this will cause writeback operations. Just anecdotally, I
increased size of the table to exceed checkpoint_flush_after on my
Postgres instance and I could get the test to cause writeback, but that
doesn't work for a portable test. This was the same reason we couldn't
test writes for VACUUM strategy.

I did notice while working on this that, with the addition of the VACUUM
parameter BUFFER_USAGE_LIMIT, we could decrease the size of the table in
the vacuum strategy reuses test. Not sure if this is legit to commit now
since it isn't required for the writeback patch set, but I included a
patch for it in this patchset.

On Thu, May 4, 2023 at 12:57 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-24 21:29:48 -0400, Melanie Plageman wrote:

2) I'm a little nervous about not including IOObject in the writeback
context. Technically, there is nothing stopping local buffer code from
calling IssuePendingWritebacks(). Right now, local buffer code doesn't
do ScheduleBufferTagForWriteback(). But it doesn't seem quite right to
hardcode in IOOBJECT_RELATION when there is nothing wrong with
requesting writeback of local buffers (AFAIK). What do you think?

I think it'd be wrong on performance grounds ;). We could add an assertion to
ScheduleBufferTagForWriteback(), I guess, to document that fact?

Now that it doesn't have access to IOObject, no need. I've added
comments elsewhere.

3) Should any restrictions be added to pgstat_tracks_io_object() or
pgstat_tracks_io_op()? I couldn't think of any backend types or IO
contexts which would not do writeback as a rule. Also, though we don't
do writeback for temp tables now, it isn't nonsensical to do so. In
this version, I didn't add any restrictions.

I think the temp table restriction could be encoded for now, I don't forsee
that changing anytime soon.

I've done that in the attached v5.

- Melanie

Attachments:

v5-0001-BUFFER_USAGE_LIMIT-reduces-needed-test-table-size.patchtext/x-patch; charset=US-ASCII; name=v5-0001-BUFFER_USAGE_LIMIT-reduces-needed-test-table-size.patchDownload

From 0ab9af06934ab551f2224493e913ae26000b3f87 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 6 May 2023 12:24:28 -0400
Subject: [PATCH v5 1/3] BUFFER_USAGE_LIMIT reduces needed test table size

Using the minimum BUFFER_USAGE_LIMIT value, the VACUUM ring buffer is
smaller and we can make one of the pg_stat_io test tables smaller while
still causing reuses.
---
 src/test/regress/expected/stats.out | 6 ++++--
 src/test/regress/sql/stats.sql      | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 813d6d39ea..d4e4889e7b 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1337,11 +1337,13 @@ SET wal_skip_threshold = '1 kB';
 SELECT sum(reuses) AS reuses, sum(reads) AS reads
   FROM pg_stat_io WHERE context = 'vacuum' \gset io_sum_vac_strategy_before_
 CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
-INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 4500)i;
 -- Ensure that the next VACUUM will need to perform IO by rewriting the table
 -- first with VACUUM (FULL).
 VACUUM (FULL) test_io_vac_strategy;
-VACUUM (PARALLEL 0) test_io_vac_strategy;
+-- Use the minimum BUFFER_USAGE_LIMIT to cause reuses with the smallest table
+-- possible.
+VACUUM (PARALLEL 0, BUFFER_USAGE_LIMIT 128) test_io_vac_strategy;
 SELECT pg_stat_force_next_flush();
  pg_stat_force_next_flush 
 --------------------------
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 99a28bb79c..7ba341314f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -670,11 +670,13 @@ SET wal_skip_threshold = '1 kB';
 SELECT sum(reuses) AS reuses, sum(reads) AS reads
   FROM pg_stat_io WHERE context = 'vacuum' \gset io_sum_vac_strategy_before_
 CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
-INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 4500)i;
 -- Ensure that the next VACUUM will need to perform IO by rewriting the table
 -- first with VACUUM (FULL).
 VACUUM (FULL) test_io_vac_strategy;
-VACUUM (PARALLEL 0) test_io_vac_strategy;
+-- Use the minimum BUFFER_USAGE_LIMIT to cause reuses with the smallest table
+-- possible.
+VACUUM (PARALLEL 0, BUFFER_USAGE_LIMIT 128) test_io_vac_strategy;
 SELECT pg_stat_force_next_flush();
 SELECT sum(reuses) AS reuses, sum(reads) AS reads
   FROM pg_stat_io WHERE context = 'vacuum' \gset io_sum_vac_strategy_after_
-- 
2.37.2

v5-0002-Update-parameter-name-context-to-wb_context.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Update-parameter-name-context-to-wb_context.patchDownload

From 87a9fa0da3895083d24bb2ee9a744a5afa11e5af Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 6 May 2023 11:11:04 -0400
Subject: [PATCH v5 2/3] Update parameter name context to wb_context

For clarity of review, renaming the function parameter "context" in
ScheduleBufferTagForWriteback() and IssuePendingWritebacks() to
"wb_context" is a separate commit. The next commit adds an "io_context"
parameter and "wb_context" makes it more clear which is which.
---
 src/backend/storage/buffer/bufmgr.c | 28 ++++++++++++++--------------
 src/include/storage/buf_internals.h |  4 ++--
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..041d5dc129 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5445,7 +5445,7 @@ WritebackContextInit(WritebackContext *context, int *max_pending)
  * Add buffer to list of pending writeback requests.
  */
 void
-ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
+ScheduleBufferTagForWriteback(WritebackContext *wb_context, BufferTag *tag)
 {
 	PendingWriteback *pending;
 
@@ -5456,11 +5456,11 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 	 * Add buffer to the pending writeback array, unless writeback control is
 	 * disabled.
 	 */
-	if (*context->max_pending > 0)
+	if (*wb_context->max_pending > 0)
 	{
-		Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);
+		Assert(*wb_context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);
 
-		pending = &context->pending_writebacks[context->nr_pending++];
+		pending = &wb_context->pending_writebacks[wb_context->nr_pending++];
 
 		pending->tag = *tag;
 	}
@@ -5470,8 +5470,8 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 	 * includes the case where previously an item has been added, but control
 	 * is now disabled.
 	 */
-	if (context->nr_pending >= *context->max_pending)
-		IssuePendingWritebacks(context);
+	if (wb_context->nr_pending >= *wb_context->max_pending)
+		IssuePendingWritebacks(wb_context);
 }
 
 #define ST_SORT sort_pending_writebacks
@@ -5489,25 +5489,25 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
  * error out - it's just a hint.
  */
 void
-IssuePendingWritebacks(WritebackContext *context)
+IssuePendingWritebacks(WritebackContext *wb_context)
 {
 	int			i;
 
-	if (context->nr_pending == 0)
+	if (wb_context->nr_pending == 0)
 		return;
 
 	/*
 	 * Executing the writes in-order can make them a lot faster, and allows to
 	 * merge writeback requests to consecutive blocks into larger writebacks.
 	 */
-	sort_pending_writebacks(context->pending_writebacks, context->nr_pending);
+	sort_pending_writebacks(wb_context->pending_writebacks, wb_context->nr_pending);
 
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
 	 * through the, now sorted, array of pending flushes, and look forward to
 	 * find all neighbouring (or identical) writes.
 	 */
-	for (i = 0; i < context->nr_pending; i++)
+	for (i = 0; i < wb_context->nr_pending; i++)
 	{
 		PendingWriteback *cur;
 		PendingWriteback *next;
@@ -5517,7 +5517,7 @@ IssuePendingWritebacks(WritebackContext *context)
 		RelFileLocator currlocator;
 		Size		nblocks = 1;
 
-		cur = &context->pending_writebacks[i];
+		cur = &wb_context->pending_writebacks[i];
 		tag = cur->tag;
 		currlocator = BufTagGetRelFileLocator(&tag);
 
@@ -5525,10 +5525,10 @@ IssuePendingWritebacks(WritebackContext *context)
 		 * Peek ahead, into following writeback requests, to see if they can
 		 * be combined with the current one.
 		 */
-		for (ahead = 0; i + ahead + 1 < context->nr_pending; ahead++)
+		for (ahead = 0; i + ahead + 1 < wb_context->nr_pending; ahead++)
 		{
 
-			next = &context->pending_writebacks[i + ahead + 1];
+			next = &wb_context->pending_writebacks[i + ahead + 1];
 
 			/* different file, stop */
 			if (!RelFileLocatorEquals(currlocator,
@@ -5555,7 +5555,7 @@ IssuePendingWritebacks(WritebackContext *context)
 		smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);
 	}
 
-	context->nr_pending = 0;
+	wb_context->nr_pending = 0;
 }
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 136cf8fbaf..f8ac811379 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -388,8 +388,8 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
  */
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
-extern void IssuePendingWritebacks(WritebackContext *context);
-extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
+extern void IssuePendingWritebacks(WritebackContext *wb_context);
+extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context, BufferTag *tag);
 
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
-- 
2.37.2

v5-0003-Add-writeback-to-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Add-writeback-to-pg_stat_io.patchDownload

From 8420944a3b57a292ce56f34a985498844e8dd3c1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 6 May 2023 11:16:58 -0400
Subject: [PATCH v5 3/3] Add writeback to pg_stat_io

28e626bde00 added the concept of IOOps but neglected to include
writeback operations. ac8d53dae5 added time spent doing these I/O
operations. Without counting writeback, checkpointer write timing in the
log often differed substantially from that in pg_stat_io. To fix this,
add IOOp IOOP_WRITEBACK and track writeback in pg_stat_io.

NOTE: Catalog version must be increased.

Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de
---
 doc/src/sgml/monitoring.sgml           | 26 +++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  2 ++
 src/backend/storage/buffer/bufmgr.c    | 32 ++++++++++++++++++++------
 src/backend/utils/activity/pgstat_io.c | 14 ++++++-----
 src/backend/utils/adt/pgstatfuncs.c    |  5 ++++
 src/include/catalog/pg_proc.dat        |  6 ++---
 src/include/pgstat.h                   |  3 ++-
 src/include/storage/buf_internals.h    |  5 ++--
 src/test/regress/expected/rules.out    |  4 +++-
 src/test/regress/expected/stats.out    |  4 ++--
 src/test/regress/sql/stats.sql         |  4 ++--
 11 files changed, 81 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 99f7f95c39..cf13ade205 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3867,6 +3867,32 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writebacks</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of units of size <varname>op_bytes</varname> which the process
+        requested the kernel write out to permanent storage.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writeback_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in writeback operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
+        includes the time spent queueing write-out requests and, potentially,
+        the time spent to write out the dirty data.
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 48aacf66ee..d0c932ad0e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1131,6 +1131,8 @@ SELECT
        b.read_time,
        b.writes,
        b.write_time,
+       b.writebacks,
+       b.writeback_time,
        b.extends,
        b.extend_time,
        b.op_bytes,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 041d5dc129..3952476d15 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1685,7 +1685,7 @@ again:
 		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
 		LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext,
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
 									  &buf_hdr->tag);
 	}
 
@@ -2725,8 +2725,11 @@ BufferSync(int flags)
 		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
 	}
 
-	/* issue all pending flushes */
-	IssuePendingWritebacks(&wb_context);
+	/*
+	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
+	 * IOContext will always be IOCONTEXT_NORMAL.
+	 */
+	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
@@ -3110,7 +3113,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 
 	UnpinBuffer(bufHdr);
 
-	ScheduleBufferTagForWriteback(wb_context, &tag);
+	/*
+	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
+	 * IOContext will always be IOCONTEXT_NORMAL.
+	 */
+	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
 
 	return result | BUF_WRITTEN;
 }
@@ -5445,7 +5452,8 @@ WritebackContextInit(WritebackContext *context, int *max_pending)
  * Add buffer to list of pending writeback requests.
  */
 void
-ScheduleBufferTagForWriteback(WritebackContext *wb_context, BufferTag *tag)
+ScheduleBufferTagForWriteback(WritebackContext *wb_context, IOContext io_context,
+							  BufferTag *tag)
 {
 	PendingWriteback *pending;
 
@@ -5471,7 +5479,7 @@ ScheduleBufferTagForWriteback(WritebackContext *wb_context, BufferTag *tag)
 	 * is now disabled.
 	 */
 	if (wb_context->nr_pending >= *wb_context->max_pending)
-		IssuePendingWritebacks(wb_context);
+		IssuePendingWritebacks(wb_context, io_context);
 }
 
 #define ST_SORT sort_pending_writebacks
@@ -5489,8 +5497,9 @@ ScheduleBufferTagForWriteback(WritebackContext *wb_context, BufferTag *tag)
  * error out - it's just a hint.
  */
 void
-IssuePendingWritebacks(WritebackContext *wb_context)
+IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context)
 {
+	instr_time	io_start;
 	int			i;
 
 	if (wb_context->nr_pending == 0)
@@ -5502,6 +5511,8 @@ IssuePendingWritebacks(WritebackContext *wb_context)
 	 */
 	sort_pending_writebacks(wb_context->pending_writebacks, wb_context->nr_pending);
 
+	io_start = pgstat_prepare_io_time();
+
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
 	 * through the, now sorted, array of pending flushes, and look forward to
@@ -5555,6 +5566,13 @@ IssuePendingWritebacks(WritebackContext *wb_context)
 		smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);
 	}
 
+	/*
+	 * Assume that writeback requests are only issued for buffers containing
+	 * blocks of permanent relations.
+	 */
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
+							IOOP_WRITEBACK, io_start, wb_context->nr_pending);
+
 	wb_context->nr_pending = 0;
 }
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 25735eb6c0..eb7d35d422 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -424,6 +424,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 * Writeback is not requested for temporary tables.
+	 */
+	if (io_object == IOOBJECT_TEMP_RELATION &&
+		(io_op == IOOP_FSYNC || io_op == IOOP_WRITEBACK))
+		return false;
+
 	/*
 	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
 	 * valid in certain contexts.
@@ -448,12 +456,6 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (strategy_io_context && io_op == IOOP_FSYNC)
 		return false;
 
-	/*
-	 * Temporary tables are not logged and thus do not require fsync'ing.
-	 */
-	if (io_context == IOCONTEXT_NORMAL &&
-		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
-		return false;
 
 	return true;
 }
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0d57b6a7c0..70da0a2de1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1268,6 +1268,8 @@ typedef enum io_stat_col
 	IO_COL_READ_TIME,
 	IO_COL_WRITES,
 	IO_COL_WRITE_TIME,
+	IO_COL_WRITEBACKS,
+	IO_COL_WRITEBACK_TIME,
 	IO_COL_EXTENDS,
 	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
@@ -1303,6 +1305,8 @@ pgstat_get_io_op_index(IOOp io_op)
 			return IO_COL_REUSES;
 		case IOOP_WRITE:
 			return IO_COL_WRITES;
+		case IOOP_WRITEBACK:
+			return IO_COL_WRITEBACKS;
 	}
 
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
@@ -1322,6 +1326,7 @@ pgstat_get_io_time_index(IOOp io_op)
 	{
 		case IOOP_READ:
 		case IOOP_WRITE:
+		case IOOP_WRITEBACK:
 		case IOOP_EXTEND:
 		case IOOP_FSYNC:
 			return pgstat_get_io_op_index(io_op) + 1;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b2bc81b15f..e766b9b600 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5760,9 +5760,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,writebacks,writeback_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff4ad5b6d..57a2c0866a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -300,9 +300,10 @@ typedef enum IOOp
 	IOOP_READ,
 	IOOP_REUSE,
 	IOOP_WRITE,
+	IOOP_WRITEBACK,
 } IOOp;
 
-#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+#define IOOP_NUM_TYPES (IOOP_WRITEBACK + 1)
 
 typedef struct PgStat_BktypeIO
 {
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f8ac811379..30807d5d97 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -388,8 +388,9 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
  */
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
-extern void IssuePendingWritebacks(WritebackContext *wb_context);
-extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context, BufferTag *tag);
+extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
+extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
+										  IOContext io_context, BufferTag *tag);
 
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 69957687fe..0bd6db2bf8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1887,6 +1887,8 @@ pg_stat_io| SELECT backend_type,
     read_time,
     writes,
     write_time,
+    writebacks,
+    writeback_time,
     extends,
     extend_time,
     op_bytes,
@@ -1896,7 +1898,7 @@ pg_stat_io| SELECT backend_type,
     fsyncs,
     fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, writebacks, writeback_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index d4e4889e7b..b836d70775 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1386,7 +1386,7 @@ SELECT pg_stat_have_stats('io', 0, 0);
  t
 (1 row)
 
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_pre_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
   FROM pg_stat_io \gset
 SELECT pg_stat_reset_shared('io');
  pg_stat_reset_shared 
@@ -1394,7 +1394,7 @@ SELECT pg_stat_reset_shared('io');
  
 (1 row)
 
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_post_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
   FROM pg_stat_io \gset
 SELECT :io_stats_post_reset < :io_stats_pre_reset;
  ?column? 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 7ba341314f..ec0b98687e 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -696,10 +696,10 @@ SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_ext
 
 -- Test IO stats reset
 SELECT pg_stat_have_stats('io', 0, 0);
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_pre_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
   FROM pg_stat_io \gset
 SELECT pg_stat_reset_shared('io');
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_post_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
   FROM pg_stat_io \gset
 SELECT :io_stats_post_reset < :io_stats_pre_reset;
 
-- 
2.37.2

#23

jkatz@postgresql.org

over 2 years ago

In reply to: Melanie Plageman (#22)

Re: pg_stat_io not tracking smgrwriteback() is confusing

On 5/6/23 1:30 PM, Melanie Plageman wrote:

I've done that in the attached v5.

[RMT hat]

RMT nudge on this thread, as we're approaching the Beta 1 cutoff. From
the above discussion, it sounds like it's pretty close to being ready.

Thanks,

Jonathan

#24

andres@anarazel.de

over 2 years ago

In reply to: Jonathan S. Katz (#23)

Re: pg_stat_io not tracking smgrwriteback() is confusing

Hi,

On 2023-05-16 10:30:27 -0400, Jonathan S. Katz wrote:

On 5/6/23 1:30 PM, Melanie Plageman wrote:

I've done that in the attached v5.

[RMT hat]

RMT nudge on this thread, as we're approaching the Beta 1 cutoff. From the
above discussion, it sounds like it's pretty close to being ready.

Thanks for the nudge. I just pushed the changes, with some very minor changes
(a newline, slight changes in commit messages).

I'll go and mark the item as closed.

Greetings,

Andres Freund

#25