EXPLAIN BUFFERS and I/O timing accounting questions

Started by Maciek Sakrejdaabout 6 years ago6 messages
#1Maciek Sakrejda
m.sakrejda@gmail.com

Hello,

I ran across an EXPLAIN plan and had some questions about some of its
details. The BUFFERS docs say

The number of blocks shown for an upper-level node includes those used by

all its child nodes.

I initially assumed this would be cumulative, but I realized it's probably
not because some of the blocks affected by each child will actually
overlap. But this particular plan has a Shared Hit Blocks at the root (an
Aggregate) that is smaller than some of its children (three ModifyTables
and a CTE Scan). This seems to contradict the documentation (since if
children overlap fully in their buffers usage, the parent should still have
a cost equal to the costliest child)--any idea what's up? I can send the
whole plan (attached? inline? it's ~15kb) if that helps.

I also noticed the I/O Read Time (from track_io_timing) of two children in
this plan is equal to the I/O Read Time in the root. Is I/O time
potentially fully parallelized across children? There are no parallel
workers according to the plan, so I'm surprised at this and would like to
understand better.

Also, a tangential question: why is the top-level structure of a JSON plan
an array? I've only ever seen one root node with a Plan key there.

Thanks,
Maciek

#2Maciek Sakrejda
m.sakrejda@gmail.com
In reply to: Maciek Sakrejda (#1)
Re: EXPLAIN BUFFERS and I/O timing accounting questions

Also, I noticed that in this plan, the root (again, an Aggregate) has 0
Temp Read Blocks, but two of its children (two of the ModifyTable nodes)
have non-zero Temp Read Blocks. Again, this contradicts the documentation,
as these costs are stated to be cumulative. Any ideas?

Thanks,
Maciek

#3Andres Freund
andres@anarazel.de
In reply to: Maciek Sakrejda (#1)
Re: EXPLAIN BUFFERS and I/O timing accounting questions

Hi,

On 2019-10-21 23:18:32 -0700, Maciek Sakrejda wrote:

I ran across an EXPLAIN plan and had some questions about some of its
details. The BUFFERS docs say

The number of blocks shown for an upper-level node includes those used by

all its child nodes.

I initially assumed this would be cumulative, but I realized it's probably
not because some of the blocks affected by each child will actually
overlap.

Note that the buffer access stats do *not* count the number of distinct
buffers accessed, but that they purely the number of buffer
accesses.

It'd be really expensive to count the number of distinct buffers
accessed, although I guess one could make it only expensive by using
something like hyperloglog (although that will still be hard, due to
buffer replacement etc).

But this particular plan has a Shared Hit Blocks at the root (an
Aggregate) that is smaller than some of its children (three ModifyTables
and a CTE Scan).

Do you have an example? I assume what's going on is that the cost of
the CTE is actually attributed (in equal parts or something like that)
to all places using the CTE. Do the numbers add up if you just exclude
the CTE?

This seems to contradict the documentation (since if
children overlap fully in their buffers usage, the parent should still have
a cost equal to the costliest child)--any idea what's up? I can send the
whole plan (attached? inline? it's ~15kb) if that helps.

Or just relevant top-level excerpts.

Also, a tangential question: why is the top-level structure of a JSON plan
an array? I've only ever seen one root node with a Plan key there.

IIRC one can get multiple plans when there's a DO ALSO rule. There might
be other ways to get there too.

Greetings,

Andres Freund

#4Maciek Sakrejda
m.sakrejda@gmail.com
In reply to: Andres Freund (#3)
Re: EXPLAIN BUFFERS and I/O timing accounting questions

On Thu, Oct 24, 2019 at 2:25 PM Andres Freund <andres@anarazel.de> wrote:

Note that the buffer access stats do *not* count the number of distinct
buffers accessed, but that they purely the number of buffer
accesses.

You mean, even within a single node? That is, if a node hits a block ten
times, that counts as ten blocks hit? And if it reads a block and then
needs it three more times, that's one read plus three hit?

Do you have an example?

Sure, here's the "abridged" plan:

[{ "Plan": {
"Node Type": "Aggregate",
"Plan Rows": 1,
"Plan Width": 8,
"Total Cost": 26761745.14,
"Actual Rows": 1,
"I/O Read Time": 234129.299,
"I/O Write Time": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Shared Hit Blocks": 4847762,
"Shared Read Blocks": 1626312,
"Shared Dirtied Blocks": 541014,
"Shared Written Blocks": 0,
"Temp Read Blocks": 0,
"Temp Written Blocks": 4786,
"Plans": [
{
"Node Type": "ModifyTable",
"Operation": "Delete",
"Parent Relationship": "InitPlan",
"Plan Rows": 13943446,
"Plan Width": 6,
"Total Cost": 25774594.63,
"Actual Rows": 2178416,
"I/O Read Time": 234129.299,
"I/O Write Time": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Shared Hit Blocks": 4847762,
"Shared Read Blocks": 1626312,
"Shared Dirtied Blocks": 541014,
"Shared Written Blocks": 0,
"Temp Read Blocks": 0,
"Temp Written Blocks": 0,
"Plans": "<elided>"
},
{
"Node Type": "ModifyTable",
"Operation": "Delete",
"Parent Relationship": "InitPlan",
"Plan Rows": 63897788,
"Plan Width": 38,
"Total Cost": 315448.53,
"Actual Rows": 0,
"I/O Read Time": 30529.231,
"I/O Write Time": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Shared Hit Blocks": 12964205,
"Shared Read Blocks": 83260,
"Shared Dirtied Blocks": 48256,
"Shared Written Blocks": 0,
"Temp Read Blocks": 4788,
"Temp Written Blocks": 0,
"Plans": "<elided>"
},
{
"Node Type": "ModifyTable",
"Operation": "Delete",
"Parent Relationship": "InitPlan",
"Plan Rows": 45657680,
"Plan Width": 38,
"Total Cost": 357974.43,
"Actual Rows": 0,
"I/O Read Time": 24260.512,
"I/O Write Time": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Shared Hit Blocks": 10521264,
"Shared Read Blocks": 64450,
"Shared Dirtied Blocks": 36822,
"Shared Written Blocks": 0,
"Temp Read Blocks": 4788,
"Temp Written Blocks": 1,
"Plans": "<elided>"
},
{
"Node Type": "CTE Scan",
"Parent Relationship": "Outer",
"Plan Rows": 13943446,
"Plan Width": 8,
"Total Cost": 278868.92,
"Actual Rows": 2178416,
"I/O Read Time": 234129.299,
"I/O Write Time": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Shared Hit Blocks": 4847762,
"Shared Read Blocks": 1626312,
"Shared Dirtied Blocks": 541014,
"Shared Written Blocks": 0,
"Temp Read Blocks": 0,
"Temp Written Blocks": 4786
}
]
}}]

Let me know if I removed anything I shouldn't have and I can follow up with
extra info.

I assume what's going on is that the cost of
the CTE is actually attributed (in equal parts or something like that)
to all places using the CTE. Do the numbers add up if you just exclude
the CTE?

Not really--it looks like the full Shared Blocks Hit cost in the root is
the same as the CTE by itself. This is playing around with the plan in a
node console:

p[0].Plan['Shared Hit Blocks']

4847762

p[0].Plan.Plans.map(p => p['Node Type'])

[ 'ModifyTable', 'ModifyTable', 'ModifyTable', 'CTE Scan' ]

p[0].Plan.Plans.map(p => p['Shared Hit Blocks'])

[ 4847762, 12964205, 10521264, 4847762 ]

IIRC one can get multiple plans when there's a DO ALSO rule. There might
be other ways to get there too.

Thanks, good to know.

#5Andres Freund
andres@anarazel.de
In reply to: Maciek Sakrejda (#4)
Re: EXPLAIN BUFFERS and I/O timing accounting questions

Hi,

On 2019-10-24 16:31:39 -0700, Maciek Sakrejda wrote:

On Thu, Oct 24, 2019 at 2:25 PM Andres Freund <andres@anarazel.de> wrote:

Note that the buffer access stats do *not* count the number of distinct
buffers accessed, but that they purely the number of buffer
accesses.

You mean, even within a single node? That is, if a node hits a block ten
times, that counts as ten blocks hit? And if it reads a block and then
needs it three more times, that's one read plus three hit?

Correct. It's basically the number of lookups in the buffer
pool. There's some nodes that will kind repeatedly use the same buffer,
without increasing the count. E.g. a seqscan will keep the current page
pinned until all the tuples on the page have been returned.

Consider e.g. an nested loop indexscan - how would we determine whether
we've previously looked at a buffer within the indexscan, without
drastically increasing the resources?

Do you have an example?

Sure, here's the "abridged" plan:

[{ "Plan": {
"Node Type": "Aggregate",
"Plan Rows": 1,
"Plan Width": 8,
"Total Cost": 26761745.14,
"Actual Rows": 1,
"I/O Read Time": 234129.299,
"I/O Write Time": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Shared Hit Blocks": 4847762,
"Shared Read Blocks": 1626312,
"Shared Dirtied Blocks": 541014,
"Shared Written Blocks": 0,
"Temp Read Blocks": 0,
"Temp Written Blocks": 4786,
"Plans": [
{
"Node Type": "ModifyTable",
"Operation": "Delete",
"Parent Relationship": "InitPlan",
"Plan Rows": 13943446,
"Plan Width": 6,
"Total Cost": 25774594.63,
"Actual Rows": 2178416,
"I/O Read Time": 234129.299,
"I/O Write Time": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Shared Hit Blocks": 4847762,
"Shared Read Blocks": 1626312,
"Shared Dirtied Blocks": 541014,
"Shared Written Blocks": 0,
"Temp Read Blocks": 0,
"Temp Written Blocks": 0,
"Plans": "<elided>"
},

...

I think this may be partially confusing due to the way the json output
looks. Which is so bad that it's imo fair to call it a bug. Here's text
output to a similar-ish query:

Aggregate (cost=112.50..112.51 rows=1 width=8) (actual time=35.893..35.894 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=6015 dirtied=15
CTE foo
-> Delete on public.p (cost=0.00..45.00 rows=3000 width=6) (actual time=0.235..28.239 rows=3000 loops=1)
Output: p.data
Delete on public.p
Delete on public.c1
Delete on public.c2
Buffers: shared hit=6015 dirtied=15
-> Seq Scan on public.p (cost=0.00..15.00 rows=1000 width=6) (actual time=0.161..1.375 rows=1000 loops=1)
Output: p.ctid
Buffers: shared hit=5 dirtied=5
-> Seq Scan on public.c1 (cost=0.00..15.00 rows=1000 width=6) (actual time=0.147..1.314 rows=1000 loops=1)
Output: c1.ctid
Buffers: shared hit=5 dirtied=5
-> Seq Scan on public.c2 (cost=0.00..15.00 rows=1000 width=6) (actual time=0.145..1.170 rows=1000 loops=1)
Output: c2.ctid
Buffers: shared hit=5 dirtied=5
-> CTE Scan on foo (cost=0.00..60.00 rows=3000 width=0) (actual time=0.243..34.083 rows=3000 loops=1)
Output: foo.data
Buffers: shared hit=6015 dirtied=15
Planning Time: 0.508 ms
Execution Time: 36.512 ms

Note that the node below the Aggregate is actually the CTE, and that
that the DELETEs are below that. But the json, slightly abbreviated,
looks like:

[
{
"Plan": {
"Node Type": "Aggregate",
"Strategy": "Plain",
"Shared Hit Blocks": 6015,
"Shared Read Blocks": 0,
"Shared Dirtied Blocks": 15,
"Shared Written Blocks": 0,
"Plans": [
{
"Node Type": "ModifyTable",
"Operation": "Delete",
"Parent Relationship": "InitPlan",
"Subplan Name": "CTE foo",
"Output": ["p.data"],
"Target Tables": [
{
"Relation Name": "p",
"Schema": "public",
"Alias": "p"
},
{
"Relation Name": "c1",
"Schema": "public",
"Alias": "c1"
},
{
"Relation Name": "c2",
"Schema": "public",
"Alias": "c2"
}
],
"Shared Hit Blocks": 6015,
"Shared Read Blocks": 0,
"Shared Dirtied Blocks": 15,
"Shared Written Blocks": 0,
"Plans": [
{
"Node Type": "Seq Scan",
"Parent Relationship": "Member",
"Output": ["p.ctid"],
"Shared Hit Blocks": 5,
"Shared Read Blocks": 0,
"Shared Dirtied Blocks": 5,
},
{
"Node Type": "Seq Scan",
"Parent Relationship": "Member",
"Parallel Aware": false,
"Relation Name": "c1",
"Schema": "public",
"Shared Hit Blocks": 5,
"Shared Read Blocks": 0,
"Shared Dirtied Blocks": 5,
"Shared Written Blocks": 0,
"Local Hit Blocks": 0,
},
{
"Node Type": "Seq Scan",
"Parent Relationship": "Member",
"Shared Hit Blocks": 5,
"Shared Read Blocks": 0,
"Shared Dirtied Blocks": 5,
"Shared Written Blocks": 0,
}
]
},
{
"Node Type": "CTE Scan",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"CTE Name": "foo",
"Alias": "foo",
"Startup Cost": 0.00,
"Total Cost": 60.00,
"Plan Rows": 3000,
"Plan Width": 0,
"Actual Startup Time": 0.258,
"Actual Total Time": 12.737,
"Actual Rows": 3000,
"Actual Loops": 1,
"Output": ["foo.data"],
"Shared Hit Blocks": 6015,
"Shared Read Blocks": 0,
"Shared Dirtied Blocks": 15,
"Shared Written Blocks": 0,
}
]

But I still don't quite get how the IO adds up in your case.

Perhaps you could send me the full plan and query privately? And, if you
have access to that, the plain text explain?

Greetings,

Andres Freund

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#5)
Re: EXPLAIN BUFFERS and I/O timing accounting questions

Hi,

Migrating to -hackers, this seems clearly suboptimal. and confusing.

The original thread is at
/messages/by-id/20191025003834.2rswu7smheaddag3@alap3.anarazel.de

On 2019-10-24 17:38:34 -0700, Andres Freund wrote:

Perhaps you could send me the full plan and query privately? And, if you
have access to that, the plain text explain?

Maciek did, and I think I see what's going on...

(I asked whether it's ok to include the query)
On 2019-10-24 18:58:27 -0700, Maciek Sakrejda wrote:

Thanks for your help. Here is the query text:

WITH q AS (
DELETE FROM queries WHERE last_occurred_at < now() - '60 days'::interval
RETURNING queries.id
),
t1 AS (DELETE FROM query_fingerprints WHERE query_id IN (SELECT id FROM q)),
t2 as (DELETE FROM query_table_associations WHERE query_id IN
(SELECT id FROM q))
SELECT COUNT(id) FROM q

Note that t1 and t2 CTEs are not referenced in the query
itself. Normally that would mean that they're simply not evaluated - but
for CTE that include DML we force evaluation, as the result would
otherwise be inconsistent.

But that forcing happens not from within the normal query (the SELECT
COUNT(id) FROM q), but from the main executor. As the attribution of
child executor nodes to the layer above happens when the execution of a
node ends (see ExecProcNodeInstr()), the work of completing wCTEs is not
associated to any single node.

void
standard_ExecutorFinish(QueryDesc *queryDesc)
...
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);

static void
ExecPostprocessPlan(EState *estate)
..
/*
* Run any secondary ModifyTable nodes to completion, in case the main
* query did not fetch all rows from them. (We do this to ensure that
* such nodes have predictable results.)
*/
foreach(lc, estate->es_auxmodifytables)
...
slot = ExecProcNode(ps);

in contrast to e.g. AFTER triggers, which also get executed at the end
of the query, we do not associate the cost of running this post
processing work with any superior node.

Which is quite confusing, because the DML nodes do look like
they're subsidiary to another node.

I think we ought to either attribute the cost of ExecPostprocessPlan()
to a separate instrumentation that we display at the end of the explain
plan when not zero, or at least associate the cost with the parent node.

Greetings,

Andres Freund