doc: explain pgstatindex fragmentation
Hi, I thought it would be nice to give the user a better idea of what
avg_leaf_density and leaf_fragmentation mean.
Patch attached. What do you think?
Attachments:
0001-doc-explain-pgstatindex-fragmentation.patchtext/x-patch; charset=UTF-8; name=0001-doc-explain-pgstatindex-fragmentation.patchDownload
From 5c82c207776fb8cde2357081b8579ba6db195c06 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= <frederic.yhuel@dalibo.com>
Date: Tue, 5 Nov 2024 17:59:44 +0100
Subject: [PATCH] doc: explain pgstatindex fragmentation
It was quite hard to guess what leaf_fragmentation meant without looking
at pgstattuple's code. This patch aims to give to the user a better
idea of what it means.
---
doc/src/sgml/pgstattuple.sgml | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 4071da4ed9..8908b56663 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -277,6 +277,14 @@ leaf_fragmentation | 0
page-by-page, and should not be expected to represent an
instantaneous snapshot of the whole index.
</para>
+
+ <para>
+ <literal>avg_leaf_density</literal> can be seen as the inverse of bloat,
+ while <literal>leaf_fragmentation</literal> represents a measure of disorder.
+ The higher <literal>leaf_fragmentation</literal> is, the less the physical
+ order of the index leaf pages corresponds to the logical order we would have
+ just after a <command>REINDEX</command>.
+ </para>
</listitem>
</varlistentry>
--
2.39.5
Hi,
On Tue, Nov 05, 2024 at 06:36:47PM +0100, Frédéric Yhuel wrote:
Hi, I thought it would be nice to give the user a better idea of what
avg_leaf_density and leaf_fragmentation mean.Patch attached. What do you think?
Yeah, I think that can not hurt to give more details, thanks for the proposal!
A few comments:
=== 1
+ <literal>avg_leaf_density</literal> can be seen as the inverse of bloat,
I'm not sure it's good to describe something as the inverse of "something
else". See my proposal below.
=== 2
I’m not sure we need to add the extra details in a paragraph below the fields
description. What about changing the fields description?
Something concise enough like?
avg_leaf_density: shows how full leaf pages currently are (100 if full)
leaf_fragmentation: shows how much physical and logical ordering of leaf pages
differ (zero if they don't)
Also the comments made in [1]/messages/by-id/4c5dee3a-8381-4e0f-b882-d1bd950e8972@dalibo.com, [2]/messages/by-id/c70fcc72-eed6-475b-81c8-508422299351@dalibo.com and [3]/messages/by-id/e8a6db36-073e-4ca3-b38c-b42d7094cba8@dalibo.com are not linked to this main thread,
adding them for reference here (but better to keep the conversation going
by replying to this email).
[1]: /messages/by-id/4c5dee3a-8381-4e0f-b882-d1bd950e8972@dalibo.com
[2]: /messages/by-id/c70fcc72-eed6-475b-81c8-508422299351@dalibo.com
[3]: /messages/by-id/e8a6db36-073e-4ca3-b38c-b42d7094cba8@dalibo.com
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 1/22/25 12:34, Bertrand Drouvot wrote:
Hi,
On Tue, Nov 05, 2024 at 06:36:47PM +0100, Frédéric Yhuel wrote:
Hi, I thought it would be nice to give the user a better idea of what
avg_leaf_density and leaf_fragmentation mean.Patch attached. What do you think?
Yeah, I think that can not hurt to give more details, thanks for the proposal!
Hi Bertrand, thanks for your review!
A few comments:
=== 1
+ <literal>avg_leaf_density</literal> can be seen as the inverse of bloat,
I'm not sure it's good to describe something as the inverse of "something
else". See my proposal below.
Yeah... bloat is a more familiar concept, so I wanted to link these two
metrics... but "inverse" is confusing... or maybe something like that:
A small <literal>avg_leaf_density</literal> means that the index is bloated.
=== 2
I’m not sure we need to add the extra details in a paragraph below the fields
description. What about changing the fields description?Something concise enough like?
avg_leaf_density: shows how full leaf pages currently are (100 if full)
That should do :-)
leaf_fragmentation: shows how much physical and logical ordering of leaf pages
differ (zero if they don't)
It looks good to me.
I've noticed that maximum leaf_fragmentation can have a huge impact on a
range index-only scan, when reading all blocs from disks, even on my
laptop machine with SSD, but I don't know if this is the right place to
document this?
I used the following psql scripts to test the effect of
leaf_fragmentation (the first one calls the second one):
https://github.com/dalibo/misc/blob/main/fyhuel/leaf_fragmentation.sql
https://github.com/dalibo/misc/blob/main/fyhuel/evict_from_both_caches.sql
Also the comments made in [1], [2] and [3] are not linked to this main thread,
adding them for reference here (but better to keep the conversation going
by replying to this email).[1]: /messages/by-id/4c5dee3a-8381-4e0f-b882-d1bd950e8972@dalibo.com
[2]: /messages/by-id/c70fcc72-eed6-475b-81c8-508422299351@dalibo.com
[3]: /messages/by-id/e8a6db36-073e-4ca3-b38c-b42d7094cba8@dalibo.com
Indeed, I think Benoît mistakenly thought that thread aggregation was
based on thread titles alone. He appended the second conversation to the
commitfest entry.
On Tue, 2024-11-05 at 18:36 +0100, Frédéric Yhuel wrote:
Hi, I thought it would be nice to give the user a better idea of what
avg_leaf_density and leaf_fragmentation mean.Patch attached. What do you think?
I am all for explaining this better.
Here is my take. I tried to avoid "bloat", since it is jargon that
not everybody might be familiar with. I also didn't start a new
paragraph and kept it together with the explanation for index_size.
Yours,
Laurenz Albe
Attachments:
v2-0001-doc-explain-pgstatindex-fragmentation.patchtext/x-patch; charset=ISO-8859-1; name=v2-0001-doc-explain-pgstatindex-fragmentation.patchDownload
From 9b93682e5e7b882c78130abb2280e655e0ead360 Mon Sep 17 00:00:00 2001
From: Laurenz Albe <laurenz.albe@cybertec.at>
Date: Thu, 23 Jan 2025 14:44:29 +0100
Subject: [PATCH v2] doc: explain pgstatindex fragmentation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
It was quite hard to guess what leaf_fragmentation meant without looking
at pgstattuple's code. This patch aims to give to the user a better
idea of what it means.
Author: Frédéric Yhuel
Author: Laurenz Albe
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/bf110561-f774-4957-a890-bb6fab6804e0%40dalibo.com
Discussion: https://postgr.es/m/4c5dee3a-8381-4e0f-b882-d1bd950e8972@dalibo.com
---
doc/src/sgml/pgstattuple.sgml | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 4071da4ed94..4b26b56930a 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -270,6 +270,14 @@ leaf_fragmentation | 0
page than is accounted for by <literal>internal_pages + leaf_pages +
empty_pages + deleted_pages</literal>, because it also includes the
index's metapage.
+ <literal>avg_leaf_density</literal> is the fraction of the index size that
+ is taken up by user data. Since indexes have a default fillfactor of 90,
+ this should be around 0.9 for newly built indexes, but usually deteriorates
+ over time.
+ <literal>leaf_fragmentation</literal> represents a measure of disorder.
+ The higher <literal>leaf_fragmentation</literal> is, the less the physical
+ order of the index leaf pages corresponds to the logical order it would
+ have just after creation.
</para>
<para>
--
2.48.1
Hi Frédéric,
On Thu, Jan 23, 2025 at 10:00:27AM +0100, Frédéric Yhuel wrote:
On 1/22/25 12:34, Bertrand Drouvot wrote:
I'm not sure it's good to describe something as the inverse of "something
else". See my proposal below.Yeah... bloat is a more familiar concept, so I wanted to link these two
metrics
Yeah but in the (rare?) case "bloat" is not known then one would have to make
sense of it first.
I’m not sure we need to add the extra details in a paragraph below the fields
description. What about changing the fields description?Something concise enough like?
avg_leaf_density: shows how full leaf pages currently are (100 if full)
That should do :-)
Thanks!
leaf_fragmentation: shows how much physical and logical ordering of leaf pages
differ (zero if they don't)It looks good to me.
Thanks!
I've noticed that maximum leaf_fragmentation can have a huge impact on a
range index-only scan, when reading all blocs from disks, even on my laptop
machine with SSD, but I don't know if this is the right place to document
this?
Yeah, that might be worth to mention. Maybe below the descriptions then? (keeping
the changes above in the description).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 1/24/25 11:47, Bertrand Drouvot wrote:
Hi Frédéric,
On Thu, Jan 23, 2025 at 10:00:27AM +0100, Frédéric Yhuel wrote:
On 1/22/25 12:34, Bertrand Drouvot wrote:
I'm not sure it's good to describe something as the inverse of "something
else". See my proposal below.Yeah... bloat is a more familiar concept, so I wanted to link these two
metricsYeah but in the (rare?) case "bloat" is not known then one would have to make
sense of it first.
OK let's not talk about bloat then :-)
I’m not sure we need to add the extra details in a paragraph below the fields
description. What about changing the fields description?Something concise enough like?
avg_leaf_density: shows how full leaf pages currently are (100 if full)
That should do :-)
Thanks!
I don't know if you noticed Laurenz's suggestion, because he forgot to
CC you, but I like it very much. I think we should mention the default
fillfactor (90 for indexes).
leaf_fragmentation: shows how much physical and logical ordering of leaf pages
differ (zero if they don't)It looks good to me.
Thanks!
I've noticed that maximum leaf_fragmentation can have a huge impact on a
range index-only scan, when reading all blocs from disks, even on my laptop
machine with SSD, but I don't know if this is the right place to document
this?Yeah, that might be worth to mention. Maybe below the descriptions then? (keeping
the changes above in the description).
OK, thanks. I've tried to put it all together, based on v2 patch from
Laurenz. Here is a v3 patch.
(I'm unsure who should be author or reviewer, but I guess the committer
will fix that anyway, if the patch were to be merged).
Attachments:
v3-0001-doc-explain-pgstatindex-fragmentation.patchtext/x-patch; charset=UTF-8; name=v3-0001-doc-explain-pgstatindex-fragmentation.patchDownload
From ddd811b84116c0113b5bd4b0a6b8be2f3de25fa9 Mon Sep 17 00:00:00 2001
From: Laurenz Albe <laurenz.albe@cybertec.at>
Date: Thu, 23 Jan 2025 14:44:29 +0100
Subject: [PATCH v3] doc: explain pgstatindex fragmentation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
It was quite hard to guess what leaf_fragmentation meant without looking
at pgstattuple's code. This patch aims to give to the user a better
idea of what it means.
Author: Frédéric Yhuel
Author: Laurenz Albe
Reviewed-by: Bertrand Drouvot, Benoît Lobréau
Discussion: https://postgr.es/m/bf110561-f774-4957-a890-bb6fab6804e0%40dalibo.com
Discussion: https://postgr.es/m/4c5dee3a-8381-4e0f-b882-d1bd950e8972@dalibo.com
---
doc/src/sgml/pgstattuple.sgml | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 4071da4ed94..4209e7a7770 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -270,6 +270,15 @@ leaf_fragmentation | 0
page than is accounted for by <literal>internal_pages + leaf_pages +
empty_pages + deleted_pages</literal>, because it also includes the
index's metapage.
+ <literal>avg_leaf_density</literal> is the fraction of the index size that
+ is taken up by user data. Since indexes have a default fillfactor of 90,
+ this should be around 0.9 for newly built indexes, but usually deteriorates
+ over time.
+ <literal>leaf_fragmentation</literal> represents a measure of disorder.
+ A higher <literal>leaf_fragmentation</literal> indicates that the
+ physical order of the index leaf pages increasingly deviates from their
+ logical order. This can have a significant impact if a large part
+ of the index is read from disk.
</para>
<para>
--
2.45.2
Hi,
On Fri, Jan 24, 2025 at 12:34:08PM +0100, Fr�d�ric Yhuel wrote:
I don't know if you noticed Laurenz's suggestion, because he forgot to CC
you, but I like it very much. I think we should mention the default
fillfactor (90 for indexes).
Thanks for mentioning Laurenz's suggestion!
=== 1
+ Since indexes have a default fillfactor of 90, this should be around 0.9 for
+ newly built indexes
I think 0.9 should be replaced by 90 (that's the actual kind of output we'd get).
But having said that, I'm not sure we should mention those 90 stuff because it
depends of the amount of data indexed (I mean if the index has a very few
leaf pages, say < 5, then it's easy to be << 90 since it's an average). That's
probably not the majority of indexes though so maybe just nuance the sentence a
bit.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Fri, 2025-01-24 at 13:34 +0000, Bertrand Drouvot wrote:
+ Since indexes have a default fillfactor of 90, this should be around 0.9 for + newly built indexesI think 0.9 should be replaced by 90 (that's the actual kind of output we'd get).
But having said that, I'm not sure we should mention those 90 stuff because it
depends of the amount of data indexed (I mean if the index has a very few
leaf pages, say < 5, then it's easy to be << 90 since it's an average). That's
probably not the majority of indexes though so maybe just nuance the sentence a
bit.
Sorry about the 0.9.
Perhaps the wording could be more careful: ... this should be around 90 for
most newly built indexes of non-neglectable size.
Yours,
Laurenz Albe
On 1/24/25 14:58, Laurenz Albe wrote:
On Fri, 2025-01-24 at 13:34 +0000, Bertrand Drouvot wrote:
+ Since indexes have a default fillfactor of 90, this should be around 0.9 for + newly built indexesI think 0.9 should be replaced by 90 (that's the actual kind of output we'd get).
Damn! I missed that one too...
But having said that, I'm not sure we should mention those 90 stuff because it
depends of the amount of data indexed (I mean if the index has a very few
leaf pages, say < 5, then it's easy to be << 90 since it's an average). That's
probably not the majority of indexes though so maybe just nuance the sentence a
bit.Sorry about the 0.9.
Perhaps the wording could be more careful: ... this should be around 90 for
most newly built indexes of non-neglectable size.
It looks good to me (apart from the typo). v4 attached
Thanks! :-)
Attachments:
v4-0001-doc-explain-pgstatindex-fragmentation.patchtext/x-patch; charset=UTF-8; name=v4-0001-doc-explain-pgstatindex-fragmentation.patchDownload
From d78c787de917068c07dea96d247795fbe256df7d Mon Sep 17 00:00:00 2001
From: Laurenz Albe <laurenz.albe@cybertec.at>
Date: Thu, 23 Jan 2025 14:44:29 +0100
Subject: [PATCH v4] doc: explain pgstatindex fragmentation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
It was quite hard to guess what leaf_fragmentation meant without looking
at pgstattuple's code. This patch aims to give to the user a better
idea of what it means.
Author: Frédéric Yhuel
Author: Laurenz Albe
Reviewed-by: Bertrand Drouvot, Benoît Lobréau
Discussion: https://postgr.es/m/bf110561-f774-4957-a890-bb6fab6804e0%40dalibo.com
Discussion: https://postgr.es/m/4c5dee3a-8381-4e0f-b882-d1bd950e8972@dalibo.com
---
doc/src/sgml/pgstattuple.sgml | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 4071da4ed94..c747a5818ab 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -270,6 +270,15 @@ leaf_fragmentation | 0
page than is accounted for by <literal>internal_pages + leaf_pages +
empty_pages + deleted_pages</literal>, because it also includes the
index's metapage.
+ <literal>avg_leaf_density</literal> is the fraction of the index size that
+ is taken up by user data. Since indexes have a default fillfactor of 90,
+ this should be around 90 for newly built indexes of non-negligible size,
+ but usually deteriorates over time.
+ <literal>leaf_fragmentation</literal> represents a measure of disorder.
+ A higher <literal>leaf_fragmentation</literal> indicates that the
+ physical order of the index leaf pages increasingly deviates from their
+ logical order. This can have a significant impact if a large part
+ of the index is read from disk.
</para>
<para>
--
2.45.2
On Fri, 2025-01-24 at 15:41 +0100, Frédéric Yhuel wrote:
v4 attached
Looks good to me. I have one question left: the explanation for the performance
penalty of a high leaf fragmentation sounds like it would only be relevant for
disks where sequential reads are faster. If that is correct, perhaps it would be
worth mentioning.
Yours,
Laurenz Albe
On 1/25/25 7:07 PM, Laurenz Albe wrote:
Looks good to me. I have one question left: the explanation for the performance
penalty of a high leaf fragmentation sounds like it would only be relevant for
disks where sequential reads are faster. If that is correct, perhaps it would be
worth mentioning.
Hi Laurenz,
Frédéric is in holiday this week. So he might not be able to answer,
I'll try to do it in his stead.
Frederic noticed a performance hit even for on his laptop with a SSD.
On Fri, 2025-01-24 at 15:41 +0100, Frédéric Yhuel wrote:
I've noticed that maximum leaf_fragmentation can have a huge impact on
a range index-only scan, when reading all blocs from disks, even on my
laptop machine with SSD, but I don't know if this is the right place
to document this?
He reported to our team, that he did a test with two indexes on the same
data. They had the same density but one had no fragmentation while the
other had 100%. He got an execution time of ~90ms (0 frag) vs ~340ms
100% frag).
I get similar result with my laptor (except my disk is significantly
worse: ~152ms vs ~833ms).
Here are the scripts.
--
Benoit Lobréau
Consultant
http://dalibo.com
On Mon, 2025-01-27 at 10:13 +0100, Benoit Lobréau wrote:
On 1/25/25 7:07 PM, Laurenz Albe wrote:
Looks good to me. I have one question left: the explanation for the performance
penalty of a high leaf fragmentation sounds like it would only be relevant for
disks where sequential reads are faster. If that is correct, perhaps it would be
worth mentioning.Frederic noticed a performance hit even for on his laptop with a SSD.
On Fri, 2025-01-24 at 15:41 +0100, Frédéric Yhuel wrote:
> I've noticed that maximum leaf_fragmentation can have a huge impact on
> a range index-only scan, when reading all blocs from disks, even on my
> laptop machine with SSD, but I don't know if this is the right place
> to document this?He reported to our team, that he did a test with two indexes on the same
data. They had the same density but one had no fragmentation while the
other had 100%. He got an execution time of ~90ms (0 frag) vs ~340ms
100% frag).I get similar result with my laptor (except my disk is significantly
worse: ~152ms vs ~833ms).
Thanks for checking.
I'll set the patch "ready for committer".
I personally would still like to know how fragmentation slows down performance.
Yours,
Laurenz Albe
On Mon, 27 Jan 2025 at 11:21, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
He reported to our team, that he did a test with two indexes on the same
data. They had the same density but one had no fragmentation while the
other had 100%. He got an execution time of ~90ms (0 frag) vs ~340ms
100% frag).I get similar result with my laptor (except my disk is significantly
worse: ~152ms vs ~833ms).Thanks for checking.
I'll set the patch "ready for committer".
I personally would still like to know how fragmentation slows down performance.
Probable reason is that scanning an unfragmented index results in
sequential I/O patterns that the kernel read-ahead mechanism detects
and does prefetching for us. Fragmented index looks like random I/O
and gets to wait for the full disk latency for every index page.
This is one of the tricky parts to fix for AIO, as directIO will also
bypass this mechanism. PostgreSQL would need to start issuing those
prefetches itself to not have a regression there.
In a theoretical world, where we would be able to drive prefetches
from an inner B-tree page, the difference between fragmented and
unfragmented indexes would be much less.
--
Ants Aasma
On 1/27/25 20:56, Ants Aasma wrote:
I'll set the patch "ready for committer".
Thanks!
I personally would still like to know how fragmentation slows down performance.
Probable reason is that scanning an unfragmented index results in
sequential I/O patterns that the kernel read-ahead mechanism detects
and does prefetching for us. Fragmented index looks like random I/O
and gets to wait for the full disk latency for every index page.This is one of the tricky parts to fix for AIO, as directIO will also
bypass this mechanism. PostgreSQL would need to start issuing those
prefetches itself to not have a regression there.In a theoretical world, where we would be able to drive prefetches
from an inner B-tree page, the difference between fragmented and
unfragmented indexes would be much less.
Thank you Ants, I think you are right.
I run my test again with kernel readahead disabled (with the blockdev
--setra 0 /dev/sdX command), and I obtained the following numbers with
the unfragmented index:
Buffers: shared hit=1 read=4953
I/O Timings: shared read=252.167
and these with the fragmented one:
Buffers: shared hit=1 read=4904
I/O Timings: shared read=569.341
With kernel readahead enabled, it was:
Buffers: shared hit=1 read=4953
I/O Timings: shared read=18.087
and:
Buffers: shared hit=1 read=4904
I/O Timings: shared read=336.984
I run the tests many times and there is very little variation.
We see that random access benefits a little from kernel readahead, but I
suspect that's because the blocks of the index aren't completely
scattered across the disk.
More interestingly, when kernel readahead is disabled, we see that
scanning the fragmented index still takes twice as long as scanning of
unfragmented one. AFAIK, this is normal for a SSD. Isn't it? (I always
thought that random reads and sequential reads would be almost equally
fast on an SSD, but this does not seem to be the case).
On 24.01.25 15:41, Frédéric Yhuel wrote:
On 1/24/25 14:58, Laurenz Albe wrote:
On Fri, 2025-01-24 at 13:34 +0000, Bertrand Drouvot wrote:
+ Since indexes have a default fillfactor of 90, this should be around 0.9 for + newly built indexesI think 0.9 should be replaced by 90 (that's the actual kind of
output we'd get).Damn! I missed that one too...
But having said that, I'm not sure we should mention those 90 stuff
because it
depends of the amount of data indexed (I mean if the index has a very
few
leaf pages, say < 5, then it's easy to be << 90 since it's an
average). That's
probably not the majority of indexes though so maybe just nuance the
sentence a
bit.Sorry about the 0.9.
Perhaps the wording could be more careful: ... this should be around
90 for
most newly built indexes of non-neglectable size.It looks good to me (apart from the typo). v4 attached
committed