[patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

a.kuzmenkov@postgrespro.ru

almost 8 years ago

In reply to: David Gould (#1)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Hi David,

I was able to reproduce the problem using your script.
analyze_counts.awk is missing, though.

The idea of using the result of ANALYZE as-is, without additional
averaging, was discussed when vac_estimate_reltuples() was introduced
originally. Ultimately, it was decided not to do so. You can find the
discussion in this thread:
/messages/by-id/BANLkTinL6QuAm_Xf8teRZboG2Mdy3dR_vw@mail.gmail.com

The core problem here seems to be that this calculation of moving
average does not converge in your scenario. It can be shown that when
the number of live tuples is constant and the number of pages grows, the
estimated number of tuples will increase at each step. Do you think we
can use some other formula that would converge in this scenario, but
still filter the noise in ANALYZE results? I couldn't think of one yet.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

See messages:
/messages/by-id/BANLkTimVhdO_bKQagRsH0OLp7MxgJZDryg@mail.gmail.com
/messages/by-id/BANLkTimaDj950K-298JW09RrmG0eJ_C=qQ@mail.gmail.com
/messages/by-id/28116.1306609295@sss.pgh.pa.us

daveg@sonic.net

almost 8 years ago

In reply to: Alexander Kuzmenkov (#2)

1 attachment(s)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Wed, 28 Feb 2018 15:55:19 +0300
Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:

Hi David,

I was able to reproduce the problem using your script.
analyze_counts.awk is missing, though.

Attached now I hope. I think I also added it to the commitfest page.

The idea of using the result of ANALYZE as-is, without additional
averaging, was discussed when vac_estimate_reltuples() was introduced
originally. Ultimately, it was decided not to do so. You can find the
discussion in this thread:
/messages/by-id/BANLkTinL6QuAm_Xf8teRZboG2Mdy3dR_vw@mail.gmail.com

Well that was a long discussion. I'm not sure I would agree that there was a
firm conclusion on what to do about ANALYZE results. There was some
recognition that the case of ANALYZE is different than VACUUM and that is
reflected in the original code comments too. However the actual code ended up
being the same for both ANALYZE and VACUUM. This patch is about that.

The core problem here seems to be that this calculation of moving
average does not converge in your scenario. It can be shown that when
the number of live tuples is constant and the number of pages grows, the
estimated number of tuples will increase at each step. Do you think we
can use some other formula that would converge in this scenario, but
still filter the noise in ANALYZE results? I couldn't think of one yet.

Besides the test data generated with the script I have parsed the analyze
verbose output for several large production systems running complex
applications and have found that for tables larger than the statistics
sample size (300*default_statistics_target) the row count you can caculate
from (pages/sample_pages) * live_rows is pretty accurate, within a few
percent of the value from count(*).

In theory the sample pages analyze uses should represent the whole table
fairly well. We rely on this to generate pg_statistic and it is a key
input to the planner. Why should we not believe in it as much only for
reltuples? If the analyze sampling does not work, the fix would be to improve
that, not to disregard it piecemeal.

My motivation is that I have seen large systems fighting mysterious run-away
bloat for years no matter how aggressively autovacuum is tuned. The fact that
an inflated reltuples can cause autovacuum to simply ignore tables forever
seems worth fixing.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

a.kuzmenkov@postgrespro.ru

almost 8 years ago

In reply to: David Gould (#3)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On 01.03.2018 06:23, David Gould wrote:

In theory the sample pages analyze uses should represent the whole table
fairly well. We rely on this to generate pg_statistic and it is a key
input to the planner. Why should we not believe in it as much only for
reltuples? If the analyze sampling does not work, the fix would be to improve
that, not to disregard it piecemeal.

Well, that sounds reasonable. But the problem with the moving average
calculation remains. Suppose you run vacuum and not analyze. If the
updates are random enough, vacuum won't be able to reclaim all the
pages, so the number of pages will grow. Again, we'll have the same
thing where the number of pages grows, the real number of live tuples
stays constant, and the estimated reltuples grows after each vacuum run.

I did some more calculations on paper to try to understand this. If we
average reltuples directly, instead of averaging tuple density, it
converges like it should. The error with this density calculation seems
to be that we're effectively multiplying the old density by the new
number of pages. I'm not sure why we even work with tuple density. We
could just estimate the number of tuples based on analyze/vacuum, and
then apply moving average to it. The calculations would be shorter, too.
What do you think?

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Alexander Kuzmenkov (#4)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> writes:

On 01.03.2018 06:23, David Gould wrote:

In theory the sample pages analyze uses should represent the whole table
fairly well. We rely on this to generate pg_statistic and it is a key
input to the planner. Why should we not believe in it as much only for
reltuples? If the analyze sampling does not work, the fix would be to improve
that, not to disregard it piecemeal.

Well, that sounds reasonable. But the problem with the moving average
calculation remains. Suppose you run vacuum and not analyze. If the
updates are random enough, vacuum won't be able to reclaim all the
pages, so the number of pages will grow. Again, we'll have the same
thing where the number of pages grows, the real number of live tuples
stays constant, and the estimated reltuples grows after each vacuum run.

You claimed that before, with no more evidence than this time, and I still
don't follow your argument. The number of pages may indeed bloat but the
number of live tuples per page will fall. Ideally, at least, the estimate
would remain on-target. If it doesn't, there's some other effect that
you haven't explained. It doesn't seem to me that the use of a moving
average would prevent that from happening. What it *would* do is smooth
out errors from the inevitable sampling bias in any one vacuum or analyze
run, and that seems like a good thing.

I did some more calculations on paper to try to understand this. If we
average reltuples directly, instead of averaging tuple density, it
converges like it should. The error with this density calculation seems
to be that we're effectively multiplying the old density by the new
number of pages. I'm not sure why we even work with tuple density. We
could just estimate the number of tuples based on analyze/vacuum, and
then apply moving average to it. The calculations would be shorter, too.
What do you think?

I think you're reinventing the way we used to do it. Perhaps consulting
the git history in the vicinity of this code would be enlightening.

regards, tom lane

a.kuzmenkov@postgrespro.ru

almost 8 years ago

In reply to: Tom Lane (#5)

1 attachment(s)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On 01.03.2018 18:09, Tom Lane wrote:

Ideally, at least, the estimate would remain on-target.

The test shows that under this particular scenario the estimated number
of tuples grows after each ANALYZE. I tried to explain how this happens
in the attached pdf. The direct averaging of the number of tuples, not
using the density, doesn't have this problem, so I suppose it could help.

I think you're reinventing the way we used to do it. Perhaps consulting
the git history in the vicinity of this code would be enlightening.

I see that before vac_estimate_reltuples was introduced, the results of
analyze and vacuum were used directly, without averaging. What I am
suggesting is to use a different way of averaging, not to remove it.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

daveg@sonic.net

almost 8 years ago

In reply to: Alexander Kuzmenkov (#4)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Thu, 1 Mar 2018 17:25:09 +0300
Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:

Well, that sounds reasonable. But the problem with the moving average
calculation remains. Suppose you run vacuum and not analyze. If the
updates are random enough, vacuum won't be able to reclaim all the
pages, so the number of pages will grow. Again, we'll have the same
thing where the number of pages grows, the real number of live tuples
stays constant, and the estimated reltuples grows after each vacuum run.

I agree VACUUM's moving average may be imperfect, but the rationale makes
sense and I don't have a plan to improve it now. This patch only intends to
improve the behavior of ANALYZE by using the estimated row density time
relpages to get reltuples. It does not change VACUUM.

The problem with the moving average for ANALYZE is that it prevents ANALYZE
from changing the reltuples estimate enough for large tables.

Consider this based on the test setup from the patch:

create table big as select id*p, ten, hun, thou, tenk, lahk, meg, padding
from reltuples_test,
generate_series(0,9) g(p);
-- SELECT 100000000
alter table big set (autovacuum_enabled=false);

select count(*) from big;
-- count
-- 100000000
select reltuples::int, relpages from pg_class where relname = 'big';
-- reltuples | relpages
-- 0 | 0

analyze verbose big;
-- INFO: analyzing "public.big"
-- INFO: "big": scanned 30000 of 1538462 pages, containing 1950000 live rows and 0 dead rows;
-- 30000 rows in sample, 100000030 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
-- reltuples | relpages
-- 100000032 | 1538462

delete from big where ten > 1;
-- DELETE 80000000
select count(*) from big;
-- count
-- 20000000
select reltuples::int, relpages from pg_class where relname = 'big';
-- reltuples | relpages
-- 100000032 | 1538462

analyze verbose big;
-- INFO: analyzing "public.big"
-- INFO: "big": scanned 30000 of 1538462 pages, containing 388775 live rows and 1561225 dead rows;
-- 30000 rows in sample, 98438807 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
reltuples | relpages
98438808 | 1538462
select count(*) from big;
-- count
-- 20000000

analyze verbose big;
-- INFO: analyzing "public.big"
-- INFO: "big": scanned 30000 of 1538462 pages, containing 390885 live rows and 1559115 dead rows;
-- 30000 rows in sample, 96910137 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
reltuples | relpages
96910136 | 1538462

Table big has 1.5 million pages. ANALYZE samples 30 thousand. No matter how
many rows we change in T, ANALYZE can only change the reltuples estimate
by old_estimate + new_estimate * (30000/1538462), ie about 1.9 percent.

With the patch on this same table we get:

select count(*) from big;
-- count
-- 20000000
select reltuples::int, relpages from pg_class where relname = 'big';
reltuples | relpages
96910136 | 1538462

analyze verbose big;
-- INFO: analyzing "public.big"
-- INFO: "big": scanned 30000 of 1538462 pages, containing 390745 live rows and 1559255 dead rows;
-- 30000 rows in sample, 20038211 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
-- reltuples | relpages
-- 20038212 | 1538462

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Alexander Kuzmenkov (#6)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> writes:

On 01.03.2018 18:09, Tom Lane wrote:

Ideally, at least, the estimate would remain on-target.

The test shows that under this particular scenario the estimated number
of tuples grows after each ANALYZE. I tried to explain how this happens
in the attached pdf.

I looked at this and don't think it really answers the question. What
happens is that, precisely because we only slowly adapt our estimate of
density towards the new measurement, we will have an overestimate of
density if the true density is decreasing (even if the new measurement is
spot-on), and that corresponds exactly to an overestimate of reltuples.
No surprise there. The question is why it fails to converge to reality
over time.

I think part of David's point is that because we only allow ANALYZE to
scan a limited number of pages even in a very large table, that creates
an artificial limit on the slew rate of the density estimate; perhaps
what's happening in his tables is that the true density is dropping
faster than that limit allows us to adapt. Still, if there's that
much going on in his tables, you'd think VACUUM would be touching
enough of the table that it would keep the estimate pretty sane.
So I don't think we yet have a convincing explanation of why the
estimates drift worse over time.

Anyway, I find myself semi-persuaded by his argument that we are
already assuming that ANALYZE has taken a random sample of the table,
so why should we not believe its estimate of density too? Aside from
simplicity, that would have the advantage of providing a way out of the
situation when the existing reltuples estimate has gotten drastically off.

The sticking point in my mind right now is, if we do that, what to do with
VACUUM's estimates. If you believe the argument in the PDF that we'll
necessarily overshoot reltuples in the face of declining true density,
then it seems like that argument applies to VACUUM as well. However,
VACUUM has the issue that we should *not* believe that it looked at a
random sample of pages. Maybe the fact that it looks exactly at the
changed pages causes it to see something less than the overall density,
cancelling out the problem, but that seems kinda optimistic.

Anyway, as I mentioned in the 2011 thread, the existing computation is
isomorphic to the rule "use the old density estimate for the pages we did
not look at, and the new density estimate --- ie, exactly scanned_tuples
--- for the pages we did look at".  That still has a lot of intuitive
appeal, especially for VACUUM where there's reason to believe those page
populations aren't alike.  We could recast the code to look like it's
doing that rather than doing a moving-average, although the outcome
should be the same up to roundoff error.

regards, tom lane

a.kuzmenkov@postgrespro.ru

almost 8 years ago

In reply to: Tom Lane (#8)

2 attachment(s)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On 02.03.2018 02:49, Tom Lane wrote:

I looked at this and don't think it really answers the question. What
happens is that, precisely because we only slowly adapt our estimate of
density towards the new measurement, we will have an overestimate of
density if the true density is decreasing (even if the new measurement is
spot-on), and that corresponds exactly to an overestimate of reltuples.
No surprise there. The question is why it fails to converge to reality
over time.

The calculation I made for the first step applies to the next steps too,
with minor differences. So, the estimate increases at each step. Just
out of interest, I plotted the reltuples for 60 steps, and it doesn't
look like it's going to converge anytime soon (see attached).
Looking at the formula, this overshoot term is created when we multiply
the old density by the new number of pages. I'm not sure how to fix
this. I think we could average the number of tuples, not the densities.
The attached patch demonstrates what I mean.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

reltuples-avg.patchtext/x-patch; name=reltuples-avg.patchDownload

*** /tmp/DqhRGF_vacuum.c	2018-03-02 18:43:54.448046402 +0300
--- src/backend/commands/vacuum.c	2018-03-02 18:22:04.223070206 +0300
***************
*** 780,791 ****
  					   BlockNumber scanned_pages,
  					   double scanned_tuples)
  {
- 	BlockNumber old_rel_pages = relation->rd_rel->relpages;
  	double		old_rel_tuples = relation->rd_rel->reltuples;
! 	double		old_density;
! 	double		new_density;
  	double		multiplier;
- 	double		updated_density;
  
  	/* If we did scan the whole table, just use the count as-is */
  	if (scanned_pages >= total_pages)
--- 780,788 ----
  					   BlockNumber scanned_pages,
  					   double scanned_tuples)
  {
  	double		old_rel_tuples = relation->rd_rel->reltuples;
! 	double 		new_rel_tuples;
  	double		multiplier;
  
  	/* If we did scan the whole table, just use the count as-is */
  	if (scanned_pages >= total_pages)
***************
*** 801,839 ****
  		return old_rel_tuples;
  
  	/*
! 	 * If old value of relpages is zero, old density is indeterminate; we
! 	 * can't do much except scale up scanned_tuples to match total_pages.
  	 */
! 	if (old_rel_pages == 0)
! 		return floor((scanned_tuples / scanned_pages) * total_pages + 0.5);
  
  	/*
! 	 * Okay, we've covered the corner cases.  The normal calculation is to
! 	 * convert the old measurement to a density (tuples per page), then update
! 	 * the density using an exponential-moving-average approach, and finally
! 	 * compute reltuples as updated_density * total_pages.
! 	 *
! 	 * For ANALYZE, the moving average multiplier is just the fraction of the
! 	 * table's pages we scanned.  This is equivalent to assuming that the
! 	 * tuple density in the unscanned pages didn't change.  Of course, it
! 	 * probably did, if the new density measurement is different. But over
! 	 * repeated cycles, the value of reltuples will converge towards the
! 	 * correct value, if repeated measurements show the same new density.
! 	 *
! 	 * For VACUUM, the situation is a bit different: we have looked at a
! 	 * nonrandom sample of pages, but we know for certain that the pages we
! 	 * didn't look at are precisely the ones that haven't changed lately.
! 	 * Thus, there is a reasonable argument for doing exactly the same thing
! 	 * as for the ANALYZE case, that is use the old density measurement as the
! 	 * value for the unscanned pages.
! 	 *
! 	 * This logic could probably use further refinement.
  	 */
- 	old_density = old_rel_tuples / old_rel_pages;
- 	new_density = scanned_tuples / scanned_pages;
  	multiplier = (double) scanned_pages / (double) total_pages;
! 	updated_density = old_density + (new_density - old_density) * multiplier;
! 	return floor(updated_density * total_pages + 0.5);
  }
  
  
--- 798,825 ----
  		return old_rel_tuples;
  
  	/*
! 	 * Estimate the total number of tuples based on the density of scanned
! 	 * tuples.
  	 */
! 	new_rel_tuples = floor((scanned_tuples / scanned_pages) * total_pages + 0.5);
  
  	/*
! 	 * ANALYZE scans a representative subset of pages, so we trust its density
! 	 * estimate.
! 	 */
! 	if (is_analyze)
! 		return new_rel_tuples;
! 
! 	/*
! 	 * VACUUM scanned a nonrandom sample of pages, so we can't just scale up its
! 	 * result. For the portion of table it didn't scan, use the old number of tuples,
! 	 * and for the portion it did scan, use the number it reported. This is
! 	 * effectively an exponential moving average with adaptive factor.
  	 */
  	multiplier = (double) scanned_pages / (double) total_pages;
! 	return floor(old_rel_tuples * (1. - multiplier)
! 			+ new_rel_tuples * multiplier
! 			+ 0.5);
  }

analyze.pngimage/png; name=analyze.pngDownload

�PNG


IHDR�d�?5PLTE������???___����������� |�@��������������������������p��`��@��@���`��`��@��@��Uk/�P@������������ �������k�����z��r�E����P�������������p��.�W"�"�d����������������U��������22������������������fffMMM333@�����**�����@��0`�����@�� Ai����@���������������7�tRNS@��f	pHYs���+ IDATx�����(E[T�q��4����h���g���}���"�����^P,�=��@� �P0@� �P0@� �P0@� ����w c����dL�0!T��.dL� �|�t�yzr��t�VuS�6���w��V�	�!������� ��B�q�0��*U{�=�(����O��B~��z����?������4�g�� >&���Z0��RR�._� ?�����X4��_K�_�	@���U���>3X"�M�q�ja� ��?���I�<���t���.-n�dPO�'���I�i*o>����K��/M:2(���'��C�?��f9�W���[�M�`����,1������S��>����{D|���~k�r��w�����3
�U������'�k�E@;
�\���y�Z�����i@+����q1����sD$�[
�(!w����p��`7��)@D �"�4��!$��*��V`:9X���`(����r��6�?]d�C'��80��?W
�U��)X:��)08�-L������".�����?g
��Z��Y�����-M�X����".�r_p��`���}@��}������"���`]u�h ��l�_H�WHr�X���xw�����!�C''���8�w���@� wA ]�����G�C$�cVu������(��,�"pRH�?:@�����A�E����"]D\��`�0/�c��?�+X@bp)@�r����� ~`td`�A�N�k ��8����������8����{�_��#������������?, ~ >��`p��'�,f��4�t�h ql�_���9X���p�?:�j1����G��0K�X@�@|���0����a��#���!o}?ax�G&�5?g`0����w��#�,b~��%M�k ����C�������{C|��A������k�F~ L���
�#���}���P��UR��i�?�����Hc�@T��Q���S��eR�,���u��X�uge����6�nS+������2���?�����/��S>`��&a��U�6���8����Cp����i��Su�ok�i�iv��o����@~��;z-��J)�o�/�9XL�����)&�����R�k�y����B��������� �����������u��Z�_X2R��/ <	�P�����b>�����	@������'m�cl�y\�������3�	�tX����e�YXL����{!)�h�KV�x�>�\�>�o�6�o�]T`j�ss������ `�a'�%e�^���Q}.vH�ab	��<)���J��N$�S;<K<I��G��8�'J����g�&��!V���@v�����x�7u����}��m�'�Z�����i@�������7���q��?�]�:h���^��@VH�!����gb.�C�_�U�O��6x��U����:�?����r��6�?]d� �)����s�e����3����Kp3�����]6]l�x�/�����}�WO�e@�
-�bk�����f-LJp�� H�"��d�$�� �������x��WF
 �����)�R��� ��}�_��q�������b2.^���o��[��� nA�� ���x��G
� �i�x�����B����	�o�#���30x����������b�."C
����O�?���'������^��HH���`1b]���H�')%��� �C
���]t��)��,F�����?��\@x�_y��IO?g`��E���@�sHRpO�1��'R��� ~ >����E@
� ������_�H��O�k �@�� ��������@�p��)�)�"���p����=�
���2h����G
� �1�� P"i���3�g�\��'F��`�|o���s3�Y$����a�|o���,N��S���`0��	�Z���p��B�#�	�4f���[:��^&���c.����������{���M���C�_�����v�k��-��#v8n��_�j�g��$Gf{����u��~������������i��d��4�-�
������~������	R�,7�|���S����.��F��k�M/ ����F�r�C�ki{�:�[��/,)��[�O�I@\ ��;9?��\l��Mj5�Mnu?i��N��H\0R�3y���^��i��2R�,,6�����V�d%��x��8�U���AvQ��
x|������G��|$�/[f�L����G���!�������P*=�W���y
`�� \�
�>	P���T<���}���/����/��!�M����}w�8m�=!�����>L�����}���
f���K���#�-h���^��@VH�[�w�<J���!���*��V`|�_���!"]5\�[������!��b �9Xbp7��tqGD��@Kp3�����]6]l�j�{N�N�.���Y��'"���X,���Z
���(�A,��wd��H<��W��@�P@������_�.A�P������g��\����/�;�}�C�cF.� #m ~(�����i����q�
�ST���A��|E�c..�T�rq��:��xV����3(��&�� P*���3�
Sw����?����
8tRw@�L��~�L(�R���i������W ~(��I���b��]�
����, ~( xL��kK�$��5?p�������Rt�"�E(�a�t�h ~( xL����==/����C'=�w(B~I�	��������� ~(��I�����I���A w�>��Q.�p����T\|���^Rq��
8tq��)?��@����@���������1	����p��
8t�p����'�V��
��q����'n�������{S����.����E'�'��-�����I>0!Tkx(��o�L�1 ����� f��t�1���`�{���������z/������{���M���C�_���YR��`��{qU�����W����[�
��|�S��� g���V��;U�g��V���f�L��_w�g
#t����\XtL1)��O�`��^�`��$��]�L 9X�h������s�9	@��[�.��F��k�M/R�Bp.��T�H� �����������u��Z�_X2���~���,��ZI���N�O�|>�y}���
r��I[@
���=���\��#��gj�,@=���g�5�}d�K��� YXD<���c�MD+_��y�R}��*mj� ����<���C���0���j~[&��N���\� �C���R����J����w��!����A/��^x@������m���q��hgi1����y�dg}�l������o���������8m�=��(���jU�
{��&�U��?n�u�`K%����C#����{�������?�!���*��V`��_��g�eo0��~\*@.z%
��]5\�[������!��b �)';|��(\�{���E�D$KV�4-��T8�
�������4�1�M"��:���
����I	���Al��2DY�]2[�)@�
���E��o?p�D�"p����� �"�����O�����_�C�������l1�H����$I�;?<>�b0(�WD�P����������$3.A 1� ~( xn����>S������C'����7�*�WD �@ 
�����,?<����x/�WD�P���d`�$ ��	@����!->$���
8tZ|��_���@�0��b��;���C�c��6��+"~(��	aq7��t�Y�D1�w��;�
�[<$�9���A�:�-�~���A���� ��g�?�
�Y�t���b,Op� D :�:�H��@d0�C�������E���b<Op�x�h���q1"��YX�
 ,�
�X<+�g�bT3�`�|o
8t[|��;*@�.Fh0C�m��.����!��5��D����Of �)@���������S��e r�_���v1N��	@e������w�Z�'����V�����C���	��
���/5�J���:�?�9�cD&��lP7���w����m�:�=���)@������b�#���m0�m��`���A�����E�1��	@�����]�}�6���^����_O�T�^y�j9�!����	�:�[��/,)������0��ZIm���|.���&�[�V���� R�� I#7�t�	|�����m��6�l�����sH���
�$�P�y��8�U���AvQ��
x|���G�G@D0��kE�~�wJ��`� ������~���z�)�R�-^.���b�#������,����,�
�����<���>��w��g����f�8Y>�o�t��jU�
{����n7���EA��(!w������rG����A<��r%����m�V`�w���/�����-���OB.&c0f�����!��b �w����(�������������`�n�2�1?0]l�N��	�2�����/�
@���'%x���}�����a�GE|�R�h,�8���bJOp� D ($���@@(������_,��4\L� ���K��~o�/r�4�@���_�8��VB�� ���}W�������R�g-N��n�_�.&j��L��)�J�����<j��?fS5�@��9�0���1�H��?^����X�76b��2 ��G �<��.y��A���~����A')@5������x�"�A'E@!�q�S�w�������=��s�������\�R��A
���?�������
�
�����n��%��)�WD���)��H��������`� �E���t!��T�P��yn�C�?6s0�f��Z?p����3�������i� ]
8�h^) x<����'.�0��b�Yb=�"�,���O�_,.�d���y�s1���WZ�1]
����������0�Ad���UgxHY�:��p1/�.�j%�>h�h3 �E�'�(\����b�mw�����?9�\��Z)k�9zp1�
�,���E�bvOp� D�L�����'Yn	�z�������ES��W��(�WD�P������H�.��	��MH�S�C��b1��x��t�V5��::��pu�.�T/�TNv�G�
���`�/m������|�WJ��#��7�I�xH'w�����#�k�x�}�/�I<�bQ�%���_@�[����@<�hQ��?�3!��W�����`��/#9M���~B��@��>�?|���������y��C�����#_�d���e�>7	C��������O
�G����pDW
a�:��<���V��;U��n�������� yt�����[�����s	a{G�c�Q��%����	R�����"���K�W�]��F#�7&@�B��
�
�Y�,}E�^��;�q�j�a�AH����x�����CP)�T�������b;�o0��t@Si��<T�G<��E����������h���i�Yg!H�����	�4�w���|�J���j�&hu�0�.*��B n���������4m�2�~���
����o{y�<�GS�_8 ���)@�7^>X�A�����S;���=K����X���o_�'J����v��������1���.����Rl��i���i��i@+�\@��!�Ri���c�-h������@VHp��3t�q��]tY�K���S�����W?��@��V�����-�-�^
(�JE�f�����t�������~�
�O� �Y�R������A�����4-���c�B��r`����N`�?O�
�w�6�2�#.��?WuB�Y�dR��J�iA�w�4���b,�N�������������g����G ��_�[�#8�x��g ��x���]��?c?�g�bDSNb� k��B��9�@�/w2�3�R�9�L\��b�)`"�C�t�/����!�����*�p��G< W���Af����.
R��|�����1ZL-��:���\��3~�.Fi15�!��:�]�
b�b@��h�C���we��]��b�)@
 W�������ZD�@
pf�
��g�-��_Q��I ~H2x�v��$]��")��o��v��]��"��X#��0���d3�q!�����'H>s������[L2h���P��y����R���	�q5`�����}h�������A��|��_���Z����,��Z�����b
SL����������/�����4@�C�����'Va��+�� T5@K
 =L�?O������~Y���N9�������M���,���j%�>h�h3���3~*.�d1I�"����� �/'���VJ���B� ���@~H#x�U�K���,��D;����W�O���,"~H`���?p�
�fO|��.>s���b����x�������]L�b�)@��n {�����b�S�W�D�1)����� 3\�X�l��wX�OL		)���3~�.�j1� *���$@��"piBH4H�a#3��
�A6(��FL ;Q�ZX���������ksS)��3���t-��,e<�I�u�2
C���,I���������-�-���0@���V��;U�ok�����	H���1	�Lg_Yh�^�`��$����	�u���P
���?�mi�D������������[,;��Wo�%�d�V6X�}�����d�,A-��@��i1���v^�`���v3i�Hh�3�	�P����3�-6�	���&���D����%���2�_
�����
��
l�% � ������S�iu�F�O����<���u��_�������YX,]zd%�������y
`����?����\�O�����,=��z�-������#���W����V����a�J�)�����\��bx��F	�����d�\�x��X\��"0����Wz�S+��bk�����	b""]%�"`3�m9
�N��.�Q��l����H���S���L�q�A����������<z�q�����S�q=U�N�O���?)�����A,*~��Q������#�����	 _H��#�"}��"p1;��~@�X�� W���/@2�c�d��'��@�������i��W��(�E ;��{�� �� ~x.x&�������C'\����Q��@^0��@��K ~x(x��S@��g�IDAT|\����9t���0:
p�n�
�������� ~�=0���p�C�_���}��� ��Ar�A�4�{��W@|\���B��������~�3@�0����!L,���������!�/ij�aw���Q��@��q�o�	�.�w�
������O|��ER? (`t�"�*$�� Q(��?x���@|��ER?�'o3��,"~�.`������C?x��L���K��9X$�Cn���FG."iA�N�G��i[���9?�%?�m���S,g��,��E���j.pR�gt�`��4f��L��k2�N5��#O��D�T�T�}9)���j�z��}��ku����o��o���N{��s�XF
����������P��Tm���������?9X,C����`m��tm����V��;U�g����Hs������H0��=�O�O��������!g�����v�2������K�W�]�}�6���^����w��,��l���Q����Z������V��K��|u�gt�`�,��R�T�����7��m��j~���~��$���7��c��<P������F/]��48��ov��A��xP��x�|�J���J�
��s�E�6��=�w�*�b�/�~��9X,)��3��-����Q}.v0:
p�1��]+������S;	����O�����y�(i�b*�YZ��
���0�������q�v{�8��a^�
���x}o���w|��ul��-����}f`K%����C#��$�Z��#����������W�{?��H�@��yB���4�EW
��k���N��.���P��<z9�2��q���4-�s����i��r`���;}������?�q�H,PuC'�XZ���0)�Q���,���� ,������^
��?�� �n�r���8��~HM~��et�`�Cb)���7A�b��������a�����8��~HKB�A���� ��� A��A��S,wG��s�H
��4�V�����E�I�L�Ax�h@ <���r7{��s�H
�������.:���Q���������'(�y�.-���b97���8��~�Y��9XD�s
@��<I?<	��oc9ge?��,���@���Ga��x���a?|��\��^2����\D����t������"��S"~����!
?|���W���s�H
����\D���?�����X@�p����,��!.�������"���R�!�����x@�`������R��,��!&i�+��\D�A������#P�m�������,���2��i[��y�%=��9XD�"Tkx5���?���@3�Lf������)�����,��N�J�Me��6�����>�?��2~�q�������b��������sa��U�6��^^�����H����n��,^7���w����m����&w�*���pHD��g�F6��LTJI}�|�x24���s�Xt
�`��l[����h�~�M:�P'|FG��i�7���SY���M�q�jua�L�!j"�j>����_��JN���'m`&BX���3�	�tX���^<x,W�#>��")�0��a\�^
Z}R� ����<���>~f�%��9XD��=����N���\�<�0���t�.���t��yH�����L��H�����,=���+��_�84����/�oO=�E��<N��8���?��]a��4��gjA������,����}�?�����C#����o�_��$���������Fp`�CD$��/���6�?]d� >��b�)������xn��No7]l�j�{�?�#��&�Z����(�Al_�|gO!��T�H<�A��H���8�E�	*�L�1:r������5��@B �y����8��~#�hG�1X���3�����3���+����G\�� ��C�_����D\A�)����$	�����#.f`p���\D\��?$
�z U���� ����A�&�x7��p��b���C� � ���A�`��<�b�[ Y,�E�L�A� ���?$�;gP@���D~�����C30������� ��/|^����30���'�\�� �D���u��!��|U�+ x��"�A���@���dp��!'��|)����A�"�v�p��b�k��y�\���G��i���7��D
q1�
�0m�����:�����`B���jxH��@3@�Y��t�1�|[.�
q1���l��p�>+�>��>�?���L�Er���xD,��\�����W����;;��d�#P7�)Y��|��:U�g��V���f�L����s�����j�`��^�`�RJ����Dx( x��&+�R�k�y�����+�4�p��b#�F�r�C�ki{�:�[��/,�=���%.J������|.�����Z�o�[�O�2�������WDH\0R�3y���vZ����e?�0�����) x��F%��/Y��<^�>�o�6�o�]T`j���N�~�/����F%3[&��N���\N����/�!"�J����w�������R#B�6�I���-�"����}���c�K�Y�~�������x�7��q�v{BP�U�+�}������������!�A\��`���4J�]���F ��v��?(�`-�C�_�U�O��Vc?� �L��.����i�w��t1��������S@���|�r��1�ymiZ:���p���r`����x����C30�T��	,�V�f-LJp��qA����H�/AfK�Y������R8�f�<�b�7�X-����P�R~(
`�+( x��"{�U�8tp1�����0
�p��S@���D4\6�p��b
��8����<�	�����30�L8?�p��b�	~(`��
xy����30� IX�E �H
��`?P@���D\^�Q����Dh���A(�����@�P��X��<�p������4��8 S(�����@��+~�~�:!��.�WD�P���@��
��&�>QuB�Z���~�W�Vu#-[{����Z%��V����W��?�J��?���f~�MOki�:�w���@������w�S��sTe<�V�����I������Ug�:�XO�^�������	1|�%��/;��Xg�+~�P�������N'��\�jc[�:>��75�J�������`y�EO�)��~��F�r���z��U�(��|��^�������S������z����~���u*ng�~�Og�z_���7����u�F����w#� ��S������T������� �!������������c+�e���%?�X�����������������_k�	�G
���Y�xs*��� �s�����/��	�#e	@�C(��q���J�k���"�s���r���c��	P�������|����\���9;������Fc'`J�
��+~$?0�af/�"�P5�Y�?���-[�S�{�|>��N�}��|`�\l��$�A�~�����Hs�^
�T7�}�Q����X:�G/��
8t��;�}�Q������+�;���}Jy8�,���1&��c��@��$�E��Va����O j[0� &������v�����������
>�\@�X��1	���4��U�]&�����q~���C�
�S�v��5Z���%�� �E\��b.��J�eM���y��Z�_X2��]�3��9�X�k��v^���JNh+�H[���krq1�1�|���)��j����Y���wM� .�`0f
|m�kS��E���]�3��9�Z�����?��ou�oW���� .:�����R��ou�k����� .*������"������#4�����>L�gB
@�����~h��v��+�
����@���Q���x1�����kab�n�2��la��L���?)�[��� x%�%��+@� �P0@� �P0@��(����on����D=��K��oWG��n!�?n�[������2��8��n��g���@�����l�m�����q�"�����u���B�W
��j��EY�-�6�������	@���77�m��Z���u�d������u������4������*Y�-��_T��E ���_19�t�~A�0��"�s��/�aqX�Q[������(�������%��	@�����1�����jH�Mtn���7x��������%��	@��C��'���h8��A�b5�7��j��n�;��~�z�/YLN�- ~��	f��S�p�.GJ����_5�A��Ru'CY���I���v/�K��k�qg.������P�Y�����.���
d�j�#��M'��5��
�W7sh.��f�� ������eq>$������/Y�� ���8���lf��i���E�[s�,w����	��z�8��B
�j�cq��g��l�~!�i
�m�O����It|[\���)�L5��?`u�v���M���d^��HW�����Y\XW��l�^����q�����.�=
������e�v��N�;�� ���qq+�>
�u#������|����&�\\R��C
����>�F�`��A[���?��s
�b�V���������]���}t�br�.�!/:������Cp��(�?n������z���w��~0���~�g}1P����]����2�A���<a�q^�8X���.6�F���������lA�7�l�]|�:/A"�F�@���:q������
��8&�Y/�@
(�`��A
(�`��A
(�`��A
(�`��A
(�`��A
(�`��A�2��} m��\�.��_!��>�M���x|y��|#�����q!r��j���z�{��R������[��A(�j�������>8PJN�������M�
�0�����V���N5�����Un��|�nH/ ���i�/���;|C;���V�Y*KIp
$9@* ��4��8`�&���(L�D���`�{@z p�v�[
��m��\@�c��p�g =�����8�?����]��^��>}���7,�����{�T���L
�Avb������^�����_�Z��@�L���@�g ���3��@�@z �P0@� �P0@����;���`IEND�B`�

#10

daveg@sonic.net

almost 8 years ago

In reply to: Alexander Kuzmenkov (#9)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Fri, 2 Mar 2018 18:47:44 +0300
Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:

The calculation I made for the first step applies to the next steps too,
with minor differences. So, the estimate increases at each step. Just
out of interest, I plotted the reltuples for 60 steps, and it doesn't
look like it's going to converge anytime soon (see attached).
Looking at the formula, this overshoot term is created when we multiply
the old density by the new number of pages. I'm not sure how to fix
this. I think we could average the number of tuples, not the densities.
The attached patch demonstrates what I mean.

I'm confused at this point, I provided a patch that addresses this and a
test case. We seem to be discussing everything as if we first noticed the
issue. Have you reviewed the patch and and attached analysis and tested it?
Please commment on that?

Thanks.

Also, here is a datapoint that I found just this morning on a clients
production system:

INFO: "staging_xyz": scanned 30000 of pages, containing 63592 live rows and 964346 dead rows;
30000 rows in sample, 1959918155 estimated total rows

# select (50000953.0/30000*63592)::int as nrows;
nrows
-----------
105988686

This tables reltuples is 18 times the actual row count. It will never converge
because with 50000953 pages analyze can only adjust reltuples by 0.0006 each time.

It will also almost never get vacuumed because the autovacuum threshold of
0.2 * 1959918155 = 391983631 about 3.7 times larger than the actual row count.

The submitted patch is makes analyze effective in setting reltuples to within
a few percent of the count(*) value.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#11

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: David Gould (#10)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

David Gould <daveg@sonic.net> writes:

I'm confused at this point, I provided a patch that addresses this and a
test case. We seem to be discussing everything as if we first noticed the
issue. Have you reviewed the patch and and attached analysis and tested it?
Please commment on that?

I've looked at the patch. The questions in my mind are

(1) do we really want to go over to treating ANALYZE's tuple density
result as gospel, contradicting the entire thrust of the 2011 discussion?

(2) what should we do in the VACUUM case? Alexander's argument seems
to apply with just as much force to the VACUUM case, so either you
discount that or you conclude that VACUUM needs adjustment too.

This tables reltuples is 18 times the actual row count. It will never converge
because with 50000953 pages analyze can only adjust reltuples by 0.0006 each time.

But by the same token, analyze only looked at 0.0006 of the pages. It's
nice that for you, that's enough to get a robust estimate of the density
everywhere; but I have a nasty feeling that that won't hold good for
everybody. The whole motivation of the 2011 discussion, and the issue
that is also seen in some other nearby discussions, is that the density
can vary wildly.

regards, tom lane

#12

daveg@sonic.net

almost 8 years ago

In reply to: Tom Lane (#8)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Thu, 01 Mar 2018 18:49:20 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

The sticking point in my mind right now is, if we do that, what to do with
VACUUM's estimates. If you believe the argument in the PDF that we'll
necessarily overshoot reltuples in the face of declining true density,
then it seems like that argument applies to VACUUM as well. However,
VACUUM has the issue that we should *not* believe that it looked at a
random sample of pages. Maybe the fact that it looks exactly at the
changed pages causes it to see something less than the overall density,
cancelling out the problem, but that seems kinda optimistic.

For what it's worth, I think the current estimate formula for VACUUM is
pretty reasonable. Consider a table T with N rows and P pages clustered
on serial key k. Assume reltuples is initially correct.

Then after:

delete from T where k < 0.2 * (select max k from T);
vacuum T;

Vacuum will touch the first 20% of the pages due to visibility map, the sample
will have 0 live rows, scanned pages will be 0.2 * P.

Then according to the current code:

the new density will be:

N/P + (0/0.2*P - N/P) * 0.2
= N/P - N/P * 0.2
= 0.8 * N/P

New reltuples estimate will be 0.8 * old_reltuples. Which is what we wanted.

If we evenly distribute the deletes across the table:

delete from T where rand() < 0.2;

Then vacuum will scan all the pages, the sample will have 0.8 * N live rows,
scanned pages will be 1.0 * P. The new density will be

N/P + (0.8 * N/1.0*P - N/P) * 1.0
= N/P + (0.8 N/P - N/P)
= N/P - 0.2 * N/P
= 0.8 * N/P

Which again gives new reltuples as 0.8 * old_reltuples and is again correct.

I believe that given a good initial estimate of reltuples and relpages and
assuming that the pages vacuum does not scan do not change density then the
vacuum calculation does the right thing.

However, for ANALYZE the case is different.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#13

http://www.research-advisors.com/tools/SampleSize.htm

daveg@sonic.net

almost 8 years ago

In reply to: Tom Lane (#11)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Fri, 02 Mar 2018 17:17:29 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

But by the same token, analyze only looked at 0.0006 of the pages. It's
nice that for you, that's enough to get a robust estimate of the density
everywhere; but I have a nasty feeling that that won't hold good for
everybody.

My grasp of statistics is somewhat weak, so please inform me if I've got
this wrong, but every time I've looked into it I've found that one can get
pretty good accuracy and confidence with fairly small samples. Typically 1000
samples will serve no matter the population size if the desired margin of
error is 5%. Even with 99% confidence and a 1% margin of error it takes less
than 20,000 samples. See the table at:

Since we have by default 30000 sample pages and since ANALYZE takes some
trouble to get a random sample I think we really can rely on the results of
extrapolating reltuples from analyze.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#14

Jeff Janes

jeff.janes@gmail.com

almost 8 years ago

In reply to: David Gould (#1)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Wed, Jan 17, 2018 at 4:49 PM, David Gould <daveg@sonic.net> wrote:

Please add the attached patch and this discussion to the open commit fest.
The
original bugs thread is here: 20180111111254.1408.8342@wrigl
eys.postgresql.org.

Bug reference: 15005
Logged by: David Gould
Email address: daveg@sonic.net
PostgreSQL version: 10.1 and earlier
Operating system: Linux
Description:

ANALYZE can make pg_class.reltuples wildly inaccurate compared to the
actual
row counts for tables that are larger than the default_statistics_target.

Example from one of a clients production instances:

# analyze verbose pg_attribute;
INFO: analyzing "pg_catalog.pg_attribute"
INFO: "pg_attribute": scanned 30000 of 24519424 pages, containing 6475
live rows and 83 dead rows; 6475 rows in sample, 800983035 estimated total
rows.

This is a large complex database - pg_attribute actually has about five
million rows and needs about one hundred thouand pages. However it has
become extremely bloated and is taking 25 million pages (192GB!), about 250
times too much. This happened despite aggressive autovacuum settings and a
periodic bloat monitoring script. Since pg_class.reltuples was 800 million,
the bloat monitoring script did not detect that this table was bloated and
autovacuum did not think it needed vacuuming.

I can see how this issue would prevent ANALYZE from fixing the problem, but
I don't see how it could have caused the problem in the first place. In
your demonstration case, you had to turn off autovac in order to get it to
happen, and then when autovac is turned back on, it is all primed for an
autovac to launch, go through, touch almost all of the pages, and fix it
for you. How did your original table get into a state where this wouldn't
happen?

Maybe a well-timed crash caused n_dead_tup to get reset to zero and that is
why autovac is not kicking in? What are the pg_stat_user_table number and
the state of the visibility map for your massively bloated table, if you
still have them?

In any event, I agree with your analysis that ANALYZE should set the number
of tuples from scratch. After all, it sets the other estimates, such as
MCV, from scratch, and those are much more fragile to sampling than just
the raw number of tuples are. But if the default target is set to 1, that
would scan only 300 pages. I think that that is a little low of a sample
size to base an estimate on, but it isn't clear to that using 300 pages
plus whacking them around with an exponential averaging is really going to
be much better. And if you set your default target to 1, that is
more-or-less what you signed up for.

It is little weird to have VACUUM incrementally update and then ANALYZE
compute from scratch and discard the previous value, but no weirder than
what we currently do of having ANALYZE incrementally update despite that it
is specifically designed to representatively sample the entire table. So I
don't think we need to decide what to do about VACUUM before we can do
something about ANALYZE.

So I support your patch. There is probably more investigation and work
that could be done in this area, but those could be different patches, not
blocking this one.

Cheers,

Jeff

#15

daveg@sonic.net

almost 8 years ago

In reply to: Jeff Janes (#14)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Sun, 4 Mar 2018 07:49:46 -0800
Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Jan 17, 2018 at 4:49 PM, David Gould <daveg@sonic.net> wrote:

# analyze verbose pg_attribute;
INFO: "pg_attribute": scanned 30000 of 24519424 pages, containing 6475
live rows and 83 dead rows; 6475 rows in sample, 800983035 estimated total
rows.

I can see how this issue would prevent ANALYZE from fixing the problem, but
I don't see how it could have caused the problem in the first place. In
your demonstration case, you had to turn off autovac in order to get it to
happen, and then when autovac is turned back on, it is all primed for an
autovac to launch, go through, touch almost all of the pages, and fix it
for you. How did your original table get into a state where this wouldn't
happen?

Maybe a well-timed crash caused n_dead_tup to get reset to zero and that is
why autovac is not kicking in? What are the pg_stat_user_table number and
the state of the visibility map for your massively bloated table, if you
still have them?

We see this sort of thing pretty routinely on more than just catalogs, but
catalogs are where it really hurts. These systems are 40 cores/80 threads, 1
TB memory, Fusion IO. Databases are 5 to 10 TB with 100,000 to 200,000 tables.
Tables are updated in batches every few minutes 100 threads at a time. There
are also some long running queries that don't help. Due to the large number of
tables and high rate of mutation it can take a long time between visits from
autovacuum, especially since autovacuum builds a list of pending work and
then processes it to completion so new tables in need of vacuum can't even be
seen until all the old work is done. For what it is worth, streaming
replication doesn't work either as the single threaded recovery can't keep up
with the 80 thread mutator.

We tried relying on external scripts to address the most bloated tables, but
those also depended on reltuples to detect bloat so they missed out a lot.
For a long time we simply had recurring crisis. Once I figured out that
ANALYZE could not set reltuples effectively we worked around it by running
ANALYZE VERBOSE on all the large tables and parsing the notices to calculate
the rowcount the same way as in the patch. This works, but is a nuisance.

The main pain points are that when reltuples gets inflated there is no way
to fix it, auto vacuum stops looking at the table and hand run ANALYZE can't
reset the reltuples. The only cure is VACUUM FULL, but that is not really
practical without unacceptable amounts of downtime.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#16

daveg@sonic.net

almost 8 years ago

In reply to: Jeff Janes (#14)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Sun, 4 Mar 2018 07:49:46 -0800
Jeff Janes <jeff.janes@gmail.com> wrote:

In any event, I agree with your analysis that ANALYZE should set the number
of tuples from scratch. After all, it sets the other estimates, such as
MCV, from scratch, and those are much more fragile to sampling than just
the raw number of tuples are. But if the default target is set to 1, that
would scan only 300 pages. I think that that is a little low of a sample
size to base an estimate on, but it isn't clear to that using 300 pages
plus whacking them around with an exponential averaging is really going to
be much better. And if you set your default target to 1, that is
more-or-less what you signed up for.

It is little weird to have VACUUM incrementally update and then ANALYZE
compute from scratch and discard the previous value, but no weirder than
what we currently do of having ANALYZE incrementally update despite that it
is specifically designed to representatively sample the entire table. So I
don't think we need to decide what to do about VACUUM before we can do
something about ANALYZE.

Thanks. I was going to add the point about trusting ANALYZE for the
statistics but not for reltuples, but you beat me to it. 300 samples would be
on the small side, as you say that's asking for it. Even the old default
target of 10 gives 3000 samples which is probably plenty.

I think the method VACUUM uses is appropriate and probably correct for
VACUUM. But not for ANALYZE. Which is actually hinted at in the original
comments but not in the code.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#17

/messages/by-id/16db4468-edfa-830a-f921-39a50498e77e@2ndquadrant.com

daveg@sonic.net

almost 8 years ago

In reply to: Jeff Janes (#14)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Sun, 4 Mar 2018 07:49:46 -0800
Jeff Janes <jeff.janes@gmail.com> wrote:

I don't see how it could have caused the problem in the first place. In
your demonstration case, you had to turn off autovac in order to get it to
happen, and then when autovac is turned back on, it is all primed for an
autovac to launch, go through, touch almost all of the pages, and fix it
for you. How did your original table get into a state where this wouldn't
happen?

One more way for this to happen, vacuum was including the dead tuples in the
estimate in addition to the live tuples. This is a separate bug that tends
to aggravate the one I'm trying to fix. See the thread re BUG #15005 at:

It seems to me that VACUUM and ANALYZE somewhat disagree on what exactly
reltuples means. VACUUM seems to be thinking that

reltuples = live + dead

while ANALYZE apparently believes that

reltuples = live

There is a patch for this one from Tomas Vondra/Tom Lane that I hope it will
land in the next set of releases.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#18

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: David Gould (#17)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

David Gould <daveg@sonic.net> writes:

On Sun, 4 Mar 2018 07:49:46 -0800
Jeff Janes <jeff.janes@gmail.com> wrote:

I don't see how it could have caused the problem in the first place. In
your demonstration case, you had to turn off autovac in order to get it to
happen, and then when autovac is turned back on, it is all primed for an
autovac to launch, go through, touch almost all of the pages, and fix it
for you. How did your original table get into a state where this wouldn't
happen?

One more way for this to happen, vacuum was including the dead tuples in the
estimate in addition to the live tuples.

FWIW, I've been continuing to think about this and poke at your example,
and I am having the same difficulty as Jeff. While it's clear that if you
managed to get into a state with wildly inflated reltuples, ANALYZE would
fail to get out of it, it's not clear how you could get to such a state
without additional contributing factors. This ANALYZE behavior only seems
to result in an incremental increase in reltuples per run, and so that
shouldn't prevent autovacuum from happening and fixing the estimate ---
maybe not as quickly as it should happen, but it'd happen.

The reasons I'm harping on this are (a) if there are additional bugs
contributing to the problem, we need to identify them and fix them;
(b) we need to understand what the triggering conditions are in some
detail, so that we can decide whether this bug is bad enough to justify
back-patching a behavioral change. I remain concerned that the proposed
fix is too simplistic and will have some unforeseen side-effects, so
I'd really rather just put it in HEAD and let it go through a beta test
cycle before it gets loosed on the world.

Another issue I have after thinking more is that we need to consider
what should happen during a combined VACUUM+ANALYZE. In this situation,
with the proposed patch, we'd overwrite VACUUM's result with an estimate
derived from ANALYZE's sample ... even though VACUUM's result might've
come from a full-table scan and hence be exact. In the existing code
a small ANALYZE sample can't do a lot of damage to VACUUM's result, but
that would no longer be true with this. I'm inclined to feel that we
should trust VACUUM's result for reltuples more than ANALYZE's, on the
grounds that if there actually was any large change in reltuples, VACUUM
would have looked at most of the pages and hence have a more reliable
number. Hence I think we should skip the pg_class update for ANALYZE if
it's in a combined VACUUM+ANALYZE, at least unless ANALYZE looked at all
(most of?) the pages. That could be mechanized with something like

-	if (!inh)
+	if (!inh && !(options & VACOPT_VACUUM))

controlling do_analyze_rel's call to vac_update_relstats, maybe with a
check on scanned_pages vs total_pages. Perhaps the call to
pgstat_report_analyze needs to be filtered similarly (or maybe we still
want to report that an analyze happened, but somehow tell the stats
collector not to change its counts?)

Also, as a stylistic matter, I'd be inclined not to touch
vac_estimate_reltuples' behavior. The place where the rubber is meeting
the road is

*totalrows = vac_estimate_reltuples(onerel, true,
totalblocks,
bs.m,
liverows);
if (bs.m > 0)
*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
else
*totaldeadrows = 0.0;

and it seems to me it'd make more sense to abandon the use of
vac_estimate_reltuples entirely, and just calculate totalrows in a fashion
exactly parallel to totaldeadrows. (I think that's how the code used to
look here ...)

In HEAD, we could then drop vac_estimate_reltuples' is_analyze argument
altogether, though that would be unwise in the back branches (if we
back-patch) since we have no idea whether any third party code is calling
this function.

regards, tom lane

#19

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: David Gould (#12)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

David Gould <daveg@sonic.net> writes:

On Thu, 01 Mar 2018 18:49:20 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

The sticking point in my mind right now is, if we do that, what to do with
VACUUM's estimates.

For what it's worth, I think the current estimate formula for VACUUM is
pretty reasonable. Consider a table T with N rows and P pages clustered
on serial key k. Assume reltuples is initially correct.

If the starting condition involves uniform tuple density throughout the
table, with matching reltuples/relpages ratio, then any set of changes
followed by one VACUUM will produce the right reltuples (to within
roundoff error) at the end. This can be seen by recognizing that VACUUM
will visit every changed page, and the existing calculation is equivalent
to "assume the old tuple density is correct for the unvisited pages, and
then add on the measured tuple count within the visited pages". I'm a bit
inclined to reformulate and redocument the calculation that way, in hopes
that people would find it more convincing.

However, things get less good if the initial state is nonuniform and
we do a set of updates that line up with the nonuniformity. For
example, start with a uniformly full table, and update 50% of the
rows lying within the first 20% of the pages. Now those 20% are
only half full of live tuples, and the table has grown by 10%, with
all those added pages full. Do a VACUUM. It will process the first
20% and the new 10% of pages, and arrive at a correct reltuples count
per the above argument. But now, reltuples/relpages reflects an average
tuple density that's only about 90% of maximum. Next, delete the
surviving tuples in the first 20% of pages, and again VACUUM. VACUUM
will examine only the first 20% of pages, and find that they're devoid
of live tuples. It will then update reltuples using the 90% density
figure as the estimate of what's in the remaining pages, and that's
too small, so that reltuples will drop to about 90% of the correct
value.

Lacking an oracle (small "o"), I do not think there's much we can do
about this, without resorting to very expensive measures such as
scanning the whole table. (It's somewhat interesting to speculate
about whether scanning the table's FSM could yield useful data, but
I'm unsure that I'd trust the results much.) The best we can do is
hope that correlated update patterns like this are uncommon.

Maybe this type of situation is an argument for trusting an ANALYZE-based
estimate more than the VACUUM-based estimate. I remain uncomfortable with
that in cases where VACUUM looked at much more of the table than ANALYZE
did, though. Maybe we need some heuristic based on the number of pages
actually visited by each pass?

regards, tom lane

#20

daveg@sonic.net

almost 8 years ago

In reply to: Tom Lane (#18)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Tue, 06 Mar 2018 11:16:04 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

so that we can decide whether this bug is bad enough to justify
back-patching a behavioral change. I remain concerned that the proposed
fix is too simplistic and will have some unforeseen side-effects, so
I'd really rather just put it in HEAD and let it go through a beta test
cycle before it gets loosed on the world.

It happens to us fairly regularly and causes lots of problems. However,
I'm agreeable to putting it in head for now, my client can build from
source to pick up this patch until that ships, but doesn't want to maintain
their own fork forever. That said, if it does get though beta I'd hope we
could back-patch at that time.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#21

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Tom Lane (#11)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Fri, Mar 2, 2018 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(1) do we really want to go over to treating ANALYZE's tuple density
result as gospel, contradicting the entire thrust of the 2011 discussion?

This tables reltuples is 18 times the actual row count. It will never converge
because with 50000953 pages analyze can only adjust reltuples by 0.0006 each time.

But by the same token, analyze only looked at 0.0006 of the pages. It's
nice that for you, that's enough to get a robust estimate of the density
everywhere; but I have a nasty feeling that that won't hold good for
everybody. The whole motivation of the 2011 discussion, and the issue
that is also seen in some other nearby discussions, is that the density
can vary wildly.

I think that viewing ANALYZE's result as fairly authoritative is
probably a good idea. If ANALYZE looked at only 0.0006 of the pages,
that's because we decided that 0.0006 of the pages were all it needed
to look at in order to come up with good estimates. Having decided
that, we should turn around and decide that they are 99.94% bunk.

Now, it could be that there are data sets out there were the number of
tuples per page varies widely between different parts of the table,
and therefore sampling 0.0006 of the pages can return quite different
estimates depending on which ones we happen to pick. However, that's
a lot like saying that 0.0006 of the pages isn't really enough, and
maybe the solution is to sample more. Still, it doesn't seem
unreasonable to do some kind of smoothing, where we set the new
estimate = (newly computed estimate * X) + (previous estimate * (1 -
X)) where X might be 0.25 or whatever; perhaps X might even be
configurable.

One thing to keep in mind is that VACUUM will, in many workloads, tend
to scan the same parts of the table over and over again. For example,
consider a database of chess players which is regularly updated with
new ratings information. The records for active players will be
updated frequently, but the ratings for deceased players will rarely
change. Living but inactive players may occasionally become active
again, or may be updated occasionally for one reason or another. So,
VACUUM will keep scanning the pages that contain records for active
players but will rarely or never be asked to scan the pages for dead
players. If it so happens that these different groups of players have
rows of varying width -- perhaps we store more detailed data about
newer players but don't have full records for older ones -- then the
overall tuple density estimate will come to resemble more and more the
density of the rows that are actively being updated, rather than the
overall density of the whole table.

Even if all the tuples are the same width, it might happen in some
workload that typically insert a record, update it N times, and then
it stays fixed after that. Suppose we can fit 100 tuples into a page.
On pages were all of the records have reached their final state, there
will be 100 tuples. But on pages where updates are still happening
there will -- after VACUUM -- be fewer than 100 tuples per page,
because some fraction of the tuples on the page were dead row
versions. That's why we were vacuuming them. Suppose typically half
the tuples on each page are getting removed by VACUUM. Then, over
time, as the table grows, if only VACUUM is ever run, our estimate of
tuples per page will converge to 50, but in reality, as the table
grows, the real number is converging to 100.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Jeff Janes

jeff.janes@gmail.com

almost 8 years ago

In reply to: David Gould (#15)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Sun, Mar 4, 2018 at 3:18 PM, David Gould <daveg@sonic.net> wrote:

On Sun, 4 Mar 2018 07:49:46 -0800
Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Jan 17, 2018 at 4:49 PM, David Gould <daveg@sonic.net> wrote:

...

Maybe a well-timed crash caused n_dead_tup to get reset to zero and that

is

why autovac is not kicking in? What are the pg_stat_user_table number

and

the state of the visibility map for your massively bloated table, if you
still have them?

...

The main pain points are that when reltuples gets inflated there is no way
to fix it, auto vacuum stops looking at the table and hand run ANALYZE
can't
reset the reltuples. The only cure is VACUUM FULL, but that is not really
practical without unacceptable amounts of downtime.

But why won't an ordinary manual VACUUM (not FULL) fix it? That seems like
that is a critical thing to figure out.

As for preventing it in the first place, based on your description of your
hardware and operations, I was going to say you need to increase the max
number of autovac workers, but then I remembered you from "Autovacuum slows
down with large numbers of tables. More workers makes it slower" (
/messages/by-id/20151030133252.3033.4249@wrigleys.postgresql.org).
So you are probably still suffering from that? Your patch from then seemed
to be pretty invasive and so controversial. I had a trivial but fairly
effective patch at the time, but it now less trivial because of how shared
catalogs are dealt with (commit 15739393e4c3b64b9038d75) and I haven't
rebased it over that issue.

Cheers,

Jeff

#23

daveg@sonic.net

almost 8 years ago

In reply to: Jeff Janes (#22)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Wed, 7 Mar 2018 21:39:08 -0800
Jeff Janes <jeff.janes@gmail.com> wrote:

As for preventing it in the first place, based on your description of your
hardware and operations, I was going to say you need to increase the max
number of autovac workers, but then I remembered you from "Autovacuum slows
down with large numbers of tables. More workers makes it slower" (
/messages/by-id/20151030133252.3033.4249@wrigleys.postgresql.org).
So you are probably still suffering from that? Your patch from then seemed
to be pretty invasive and so controversial.

We have been building from source using that patch for the worker contention
since then. It's very effective, there is no way we could have continued to
rely on autovacuum without it. It's sort of a nuisance to keep updating it
for each point release that touches autovacuum, but here we are.

The current patch is motivated by the fact that even with effective workers
we still regularly find tables with inflated reltuples. I have some theories
about why, but not really proof. Mainly variants on "all the vacuum workers
were busy making their way through a list of 100,000 tables and did not get
back to the problem table before it became a problem."

I do have a design in mind for a larger more principled patch that fixes the
same issue and some others too, but given the reaction to the earlier one I
hesitate to spend a lot of time on it. I'd be happy to discuss a way to try
to move forward though if any one is interested.

Your patch helped, but mainly was targeted at the lock contention part of the
problem.

The other part of the problem was that autovacuum workers will force a rewrite
of the stats file every time they try to choose a new table to work on.
With large numbers of tables and many autovacuum workers this is a significant
extra workload.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#24

[1]: /messages/by-id/CAMkU=1zQUAV6Zv3O7R5BO8AfJO+LAw7satHYfd+V2t5MO3Bp4w@mail.gmail.com

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: David Gould (#23)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

David Gould <daveg@sonic.net> writes:

On Wed, 7 Mar 2018 21:39:08 -0800
Jeff Janes <jeff.janes@gmail.com> wrote:

As for preventing it in the first place, based on your description of your
hardware and operations, I was going to say you need to increase the max
number of autovac workers, but then I remembered you from "Autovacuum slows
down with large numbers of tables. More workers makes it slower" (
/messages/by-id/20151030133252.3033.4249@wrigleys.postgresql.org).
So you are probably still suffering from that? Your patch from then seemed
to be pretty invasive and so controversial.

We have been building from source using that patch for the worker contention
since then. It's very effective, there is no way we could have continued to
rely on autovacuum without it. It's sort of a nuisance to keep updating it
for each point release that touches autovacuum, but here we are.

Re-reading that thread, it seems like we should have applied Jeff's
initial trivial patch[1]/messages/by-id/CAMkU=1zQUAV6Zv3O7R5BO8AfJO+LAw7satHYfd+V2t5MO3Bp4w@mail.gmail.com (to not hold AutovacuumScheduleLock across
table_recheck_autovac) rather than waiting around for a super duper
improvement to get agreed on. I'm a bit tempted to go do that;
if nothing else, it seems simple enough to back-patch, unlike most
of the rest of what was discussed.

regards, tom lane

#25

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Tom Lane (#19)

1 attachment(s)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

I wrote:

Maybe this type of situation is an argument for trusting an ANALYZE-based
estimate more than the VACUUM-based estimate. I remain uncomfortable with
that in cases where VACUUM looked at much more of the table than ANALYZE
did, though. Maybe we need some heuristic based on the number of pages
actually visited by each pass?

I looked into doing something like that. It's possible, but it's fairly
invasive; there's no clean way to compare those page counts without
altering the API of acquire_sample_rows() to pass back the number of pages
it visited. That would mean a change in FDW-visible APIs. We could do
that, but I don't want to insist on it when there's nothing backing it up
except a fear that *maybe* ANALYZE's estimate will be wrong often enough
to worry about.

So at this point I'm prepared to go forward with your patch, though not
to risk back-patching it. Field experience will tell us if we need to
do more. I propose the attached cosmetic refactoring, though.

regards, tom lane

Attachments:

analyze-reltuples-change-2.patchtext/x-diff; charset=us-ascii; name=analyze-reltuples-change-2.patchDownload

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 3cfbc08..474c3bd 100644
*** a/contrib/pgstattuple/pgstatapprox.c
--- b/contrib/pgstattuple/pgstatapprox.c
*************** statapprox_heap(Relation rel, output_typ
*** 184,190 ****
  
  	stat->table_len = (uint64) nblocks * BLCKSZ;
  
! 	stat->tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned,
  											   stat->tuple_count + misc_count);
  
  	/*
--- 184,190 ----
  
  	stat->table_len = (uint64) nblocks * BLCKSZ;
  
! 	stat->tuple_count = vac_estimate_reltuples(rel, nblocks, scanned,
  											   stat->tuple_count + misc_count);
  
  	/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5f21fcb..ef93fb4 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*************** acquire_sample_rows(Relation onerel, int
*** 1249,1267 ****
  		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
  
  	/*
! 	 * Estimate total numbers of rows in relation.  For live rows, use
! 	 * vac_estimate_reltuples; for dead rows, we have no source of old
! 	 * information, so we have to assume the density is the same in unseen
! 	 * pages as in the pages we scanned.
  	 */
- 	*totalrows = vac_estimate_reltuples(onerel, true,
- 										totalblocks,
- 										bs.m,
- 										liverows);
  	if (bs.m > 0)
  		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
  	else
  		*totaldeadrows = 0.0;
  
  	/*
  	 * Emit some interesting relation info
--- 1249,1270 ----
  		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
  
  	/*
! 	 * Estimate total numbers of live and dead rows in relation, extrapolating
! 	 * on the assumption that the average tuple density in pages we didn't
! 	 * scan is the same as in the pages we did scan.  Since what we scanned is
! 	 * a random sample of the pages in the relation, this should be a good
! 	 * assumption.
  	 */
  	if (bs.m > 0)
+ 	{
+ 		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
  		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+ 	}
  	else
+ 	{
+ 		*totalrows = 0.0;
  		*totaldeadrows = 0.0;
+ 	}
  
  	/*
  	 * Emit some interesting relation info
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7aca69a..b50c554 100644
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
*************** vacuum_set_xid_limits(Relation rel,
*** 766,781 ****
   * vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples
   *
   *		If we scanned the whole relation then we should just use the count of
!  *		live tuples seen; but if we did not, we should not trust the count
!  *		unreservedly, especially not in VACUUM, which may have scanned a quite
!  *		nonrandom subset of the table.  When we have only partial information,
!  *		we take the old value of pg_class.reltuples as a measurement of the
   *		tuple density in the unscanned pages.
-  *
-  *		This routine is shared by VACUUM and ANALYZE.
   */
  double
! vac_estimate_reltuples(Relation relation, bool is_analyze,
  					   BlockNumber total_pages,
  					   BlockNumber scanned_pages,
  					   double scanned_tuples)
--- 766,779 ----
   * vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples
   *
   *		If we scanned the whole relation then we should just use the count of
!  *		live tuples seen; but if we did not, we should not blindly extrapolate
!  *		from that number, since VACUUM may have scanned a quite nonrandom
!  *		subset of the table.  When we have only partial information, we take
!  *		the old value of pg_class.reltuples as a measurement of the
   *		tuple density in the unscanned pages.
   */
  double
! vac_estimate_reltuples(Relation relation,
  					   BlockNumber total_pages,
  					   BlockNumber scanned_pages,
  					   double scanned_tuples)
*************** vac_estimate_reltuples(Relation relation
*** 783,791 ****
  	BlockNumber old_rel_pages = relation->rd_rel->relpages;
  	double		old_rel_tuples = relation->rd_rel->reltuples;
  	double		old_density;
! 	double		new_density;
! 	double		multiplier;
! 	double		updated_density;
  
  	/* If we did scan the whole table, just use the count as-is */
  	if (scanned_pages >= total_pages)
--- 781,788 ----
  	BlockNumber old_rel_pages = relation->rd_rel->relpages;
  	double		old_rel_tuples = relation->rd_rel->reltuples;
  	double		old_density;
! 	double		unscanned_pages;
! 	double		total_tuples;
  
  	/* If we did scan the whole table, just use the count as-is */
  	if (scanned_pages >= total_pages)
*************** vac_estimate_reltuples(Relation relation
*** 809,839 ****
  
  	/*
  	 * Okay, we've covered the corner cases.  The normal calculation is to
! 	 * convert the old measurement to a density (tuples per page), then update
! 	 * the density using an exponential-moving-average approach, and finally
! 	 * compute reltuples as updated_density * total_pages.
! 	 *
! 	 * For ANALYZE, the moving average multiplier is just the fraction of the
! 	 * table's pages we scanned.  This is equivalent to assuming that the
! 	 * tuple density in the unscanned pages didn't change.  Of course, it
! 	 * probably did, if the new density measurement is different. But over
! 	 * repeated cycles, the value of reltuples will converge towards the
! 	 * correct value, if repeated measurements show the same new density.
! 	 *
! 	 * For VACUUM, the situation is a bit different: we have looked at a
! 	 * nonrandom sample of pages, but we know for certain that the pages we
! 	 * didn't look at are precisely the ones that haven't changed lately.
! 	 * Thus, there is a reasonable argument for doing exactly the same thing
! 	 * as for the ANALYZE case, that is use the old density measurement as the
! 	 * value for the unscanned pages.
! 	 *
! 	 * This logic could probably use further refinement.
  	 */
  	old_density = old_rel_tuples / old_rel_pages;
! 	new_density = scanned_tuples / scanned_pages;
! 	multiplier = (double) scanned_pages / (double) total_pages;
! 	updated_density = old_density + (new_density - old_density) * multiplier;
! 	return floor(updated_density * total_pages + 0.5);
  }
  
  
--- 806,819 ----
  
  	/*
  	 * Okay, we've covered the corner cases.  The normal calculation is to
! 	 * convert the old measurement to a density (tuples per page), then
! 	 * estimate the number of tuples in the unscanned pages using that figure,
! 	 * and finally add on the number of tuples in the scanned pages.
  	 */
  	old_density = old_rel_tuples / old_rel_pages;
! 	unscanned_pages = (double) total_pages - (double) scanned_pages;
! 	total_tuples = old_density * unscanned_pages + scanned_tuples;
! 	return floor(total_tuples + 0.5);
  }
  
  
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index cf7f5e1..9ac84e8 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
*************** lazy_scan_heap(Relation onerel, int opti
*** 1286,1292 ****
  	vacrelstats->new_dead_tuples = nkeep;
  
  	/* now we can compute the new value for pg_class.reltuples */
! 	vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,
  														 nblocks,
  														 vacrelstats->tupcount_pages,
  														 num_tuples);
--- 1286,1292 ----
  	vacrelstats->new_dead_tuples = nkeep;
  
  	/* now we can compute the new value for pg_class.reltuples */
! 	vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel,
  														 nblocks,
  														 vacrelstats->tupcount_pages,
  														 num_tuples);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 797b6df..85d472f 100644
*** a/src/include/commands/vacuum.h
--- b/src/include/commands/vacuum.h
*************** extern void vacuum(int options, List *re
*** 162,168 ****
  extern void vac_open_indexes(Relation relation, LOCKMODE lockmode,
  				 int *nindexes, Relation **Irel);
  extern void vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode);
! extern double vac_estimate_reltuples(Relation relation, bool is_analyze,
  					   BlockNumber total_pages,
  					   BlockNumber scanned_pages,
  					   double scanned_tuples);
--- 162,168 ----
  extern void vac_open_indexes(Relation relation, LOCKMODE lockmode,
  				 int *nindexes, Relation **Irel);
  extern void vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode);
! extern double vac_estimate_reltuples(Relation relation,
  					   BlockNumber total_pages,
  					   BlockNumber scanned_pages,
  					   double scanned_tuples);

#26

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Tom Lane (#24)

1 attachment(s)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

I wrote:

Re-reading that thread, it seems like we should have applied Jeff's
initial trivial patch[1] (to not hold AutovacuumScheduleLock across
table_recheck_autovac) rather than waiting around for a super duper
improvement to get agreed on. I'm a bit tempted to go do that;
if nothing else, it seems simple enough to back-patch, unlike most
of the rest of what was discussed.

Jeff mentioned that that patch wasn't entirely trivial to rebase over
15739393e, and I now see why: in the code structure as it stands,
we don't know soon enough whether the table is shared. In the
attached rebase, I solved that with the brute-force method of just
reading the pg_class tuple an extra time. I think this is probably
good enough, really. I thought about having do_autovacuum pass down
the tuple to table_recheck_autovac so as to not increase the net
number of syscache fetches, but I'm slightly worried about whether
we could be passing a stale pg_class tuple to table_recheck_autovac
if we do it like that. OTOH, that just raises the question of why
we are doing any of this while holding no lock whatever on the target
table :-(. I'm content to leave that question to the major redesign
that was speculated about in the bug #13750 thread.

This also corrects the inconsistency that at the bottom, do_autovacuum
clears wi_tableoid while taking AutovacuumLock, not AutovacuumScheduleLock
as is the documented lock for that field. There's no actual bug there,
but cleaning this up might provide a slight improvement in concurrency
for operations that need AutovacuumLock but aren't looking at other
processes' wi_tableoid. (Alternatively, we could decide that there's
no real need anymore for the separate AutovacuumScheduleLock, but that's
more churn than I wanted to deal with here.)

I think this is OK to commit/backpatch --- any objections?

regards, tom lane

Attachments:

vac_move_lock-2.patchtext/x-diff; charset=us-ascii; name=vac_move_lock-2.patchDownload

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 21f5e2c..639bd71 100644
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** typedef struct autovac_table
*** 212,220 ****
   * wi_launchtime Time at which this worker was launched
   * wi_cost_*	Vacuum cost-based delay parameters current in this worker
   *
!  * All fields are protected by AutovacuumLock, except for wi_tableoid which is
!  * protected by AutovacuumScheduleLock (which is read-only for everyone except
!  * that worker itself).
   *-------------
   */
  typedef struct WorkerInfoData
--- 212,220 ----
   * wi_launchtime Time at which this worker was launched
   * wi_cost_*	Vacuum cost-based delay parameters current in this worker
   *
!  * All fields are protected by AutovacuumLock, except for wi_tableoid and
!  * wi_sharedrel which are protected by AutovacuumScheduleLock (note these
!  * two fields are read-only for everyone except that worker itself).
   *-------------
   */
  typedef struct WorkerInfoData
*************** do_autovacuum(void)
*** 2317,2323 ****
--- 2317,2325 ----
  	foreach(cell, table_oids)
  	{
  		Oid			relid = lfirst_oid(cell);
+ 		HeapTuple	classTup;
  		autovac_table *tab;
+ 		bool		isshared;
  		bool		skipit;
  		int			stdVacuumCostDelay;
  		int			stdVacuumCostLimit;
*************** do_autovacuum(void)
*** 2342,2350 ****
  		}
  
  		/*
! 		 * hold schedule lock from here until we're sure that this table still
! 		 * needs vacuuming.  We also need the AutovacuumLock to walk the
! 		 * worker array, but we'll let go of that one quickly.
  		 */
  		LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
  		LWLockAcquire(AutovacuumLock, LW_SHARED);
--- 2344,2366 ----
  		}
  
  		/*
! 		 * Find out whether the table is shared or not.  (It's slightly
! 		 * annoying to fetch the syscache entry just for this, but in typical
! 		 * cases it adds little cost because table_recheck_autovac would
! 		 * refetch the entry anyway.  We could buy that back by copying the
! 		 * tuple here and passing it to table_recheck_autovac, but that
! 		 * increases the odds of that function working with stale data.)
! 		 */
! 		classTup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
! 		if (!HeapTupleIsValid(classTup))
! 			continue;			/* somebody deleted the rel, forget it */
! 		isshared = ((Form_pg_class) GETSTRUCT(classTup))->relisshared;
! 		ReleaseSysCache(classTup);
! 
! 		/*
! 		 * Hold schedule lock from here until we've claimed the table.  We
! 		 * also need the AutovacuumLock to walk the worker array, but that one
! 		 * can just be a shared lock.
  		 */
  		LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
  		LWLockAcquire(AutovacuumLock, LW_SHARED);
*************** do_autovacuum(void)
*** 2381,2386 ****
--- 2397,2412 ----
  		}
  
  		/*
+ 		 * Store the table's OID in shared memory before releasing the
+ 		 * schedule lock, so that other workers don't try to vacuum it
+ 		 * concurrently.  (We claim it here so as not to hold
+ 		 * AutovacuumScheduleLock while rechecking the stats.)
+ 		 */
+ 		MyWorkerInfo->wi_tableoid = relid;
+ 		MyWorkerInfo->wi_sharedrel = isshared;
+ 		LWLockRelease(AutovacuumScheduleLock);
+ 
+ 		/*
  		 * Check whether pgstat data still says we need to vacuum this table.
  		 * It could have changed if something else processed the table while
  		 * we weren't looking.
*************** do_autovacuum(void)
*** 2396,2414 ****
  		if (tab == NULL)
  		{
  			/* someone else vacuumed the table, or it went away */
  			LWLockRelease(AutovacuumScheduleLock);
  			continue;
  		}
  
  		/*
- 		 * Ok, good to go.  Store the table in shared memory before releasing
- 		 * the lock so that other workers don't vacuum it concurrently.
- 		 */
- 		MyWorkerInfo->wi_tableoid = relid;
- 		MyWorkerInfo->wi_sharedrel = tab->at_sharedrel;
- 		LWLockRelease(AutovacuumScheduleLock);
- 
- 		/*
  		 * Remember the prevailing values of the vacuum cost GUCs.  We have to
  		 * restore these at the bottom of the loop, else we'll compute wrong
  		 * values in the next iteration of autovac_balance_cost().
--- 2422,2435 ----
  		if (tab == NULL)
  		{
  			/* someone else vacuumed the table, or it went away */
+ 			LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
+ 			MyWorkerInfo->wi_tableoid = InvalidOid;
+ 			MyWorkerInfo->wi_sharedrel = false;
  			LWLockRelease(AutovacuumScheduleLock);
  			continue;
  		}
  
  		/*
  		 * Remember the prevailing values of the vacuum cost GUCs.  We have to
  		 * restore these at the bottom of the loop, else we'll compute wrong
  		 * values in the next iteration of autovac_balance_cost().
*************** deleted:
*** 2522,2531 ****
  		 * settings, so we don't want to give up our share of I/O for a very
  		 * short interval and thereby thrash the global balance.
  		 */
! 		LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
  		MyWorkerInfo->wi_tableoid = InvalidOid;
  		MyWorkerInfo->wi_sharedrel = false;
! 		LWLockRelease(AutovacuumLock);
  
  		/* restore vacuum cost GUCs for the next iteration */
  		VacuumCostDelay = stdVacuumCostDelay;
--- 2543,2552 ----
  		 * settings, so we don't want to give up our share of I/O for a very
  		 * short interval and thereby thrash the global balance.
  		 */
! 		LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
  		MyWorkerInfo->wi_tableoid = InvalidOid;
  		MyWorkerInfo->wi_sharedrel = false;
! 		LWLockRelease(AutovacuumScheduleLock);
  
  		/* restore vacuum cost GUCs for the next iteration */
  		VacuumCostDelay = stdVacuumCostDelay;

#27

daveg@sonic.net

almost 8 years ago

In reply to: Tom Lane (#24)

1 attachment(s)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Mon, 12 Mar 2018 10:43:36 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Re-reading that thread, it seems like we should have applied Jeff's
initial trivial patch[1] (to not hold across
table_recheck_autovac) rather than waiting around for a super duper
improvement to get agreed on. I'm a bit tempted to go do that;
if nothing else, it seems simple enough to back-patch, unlike most
of the rest of what was discussed.

This will help. In my testing it reduced the lock contention considerably. I
think a lot of users with lots of tables will benefit from it. However it does
nothing about the bigger issue which is that autovacuum flaps the stats temp
files.

I have attached the patch we are currently using. It applies to 9.6.8. I
have versions for older releases in 9.4, 9.5, 9.6. I fails to apply to 10,
and presumably head but I can update it if there is any interest.

The patch has three main features:
- Impose an ordering on the autovacuum workers worklist to avoid
the need for rechecking statistics to skip already vacuumed tables.
- Reduce the frequency of statistics refreshes
- Remove the AutovacuumScheduleLock

The patch is aware of the shared relations fix. It is subject to the problem
Alvero noted in the original discussion: if the last table to be
autovacuumed is large new workers will exit instead of choosing an
earlier table. Not sure this is really a big problem in practice, but I agree
it is a corner case.

My patch does not do what I believe really needs doing:

Schedule autovacuum more intelligently.

Currently autovacuum collects all the tables in the physical order of
pg_class and visits them one by one. With many tables it can take too long to
respond to urgent vacuum needs, eg heavily updated tables or wraparound.
There is a patch in the current commit fest that allows prioritizing tables
manually. I don't like that idea much, but it does recognize that the current
scheme is not adequate for databases with large numbers of tables.

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Attachments:

autovacuum_contention-9.6.8.difftext/x-patchDownload

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6720248675..3bb06a2337 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -196,6 +196,26 @@ typedef struct autovac_table
 	char	   *at_datname;
 } autovac_table;
 
+/*
+ * This is used to collect the oids of tables that may need vacuuming.
+ * Once they are all collected it is sorted by oid and then stepped through.
+ * By establishing an order we reduce contention between workers for tables.
+ */
+typedef struct aw_workitem
+{
+	Oid			relid;
+	bool		sharedrel;
+} aw_workitem;
+
+/* Extendable array of items. */
+typedef struct aw_worklist
+{
+	aw_workitem	*items;
+	int			maxitems,
+				numitems;
+} aw_worklist;
+
+
 /*-------------
  * This struct holds information about a single worker's whereabouts.  We keep
  * an array of these in shared memory, sized according to
@@ -1069,6 +1089,76 @@ db_comparator(const void *a, const void *b)
 		return (((const avl_dbase *) a)->adl_score < ((const avl_dbase *) b)->adl_score) ? 1 : -1;
 }
 
+
+/*
+ * Utilities to manage worklist of tables to vacuum.
+ */
+
+static void aw_worklist_init(aw_worklist *w)
+{
+	w->maxitems = 32;	/* not really a magic number, we have to start somewhere */
+	w->numitems = 0;
+	w->items = palloc(w->maxitems * sizeof(aw_workitem));
+}
+
+static void aw_worklist_clear(aw_worklist *w)
+{
+	w->maxitems = 0;
+	w->numitems = 0;
+	pfree(w->items);
+	w->items = NULL;
+}
+
+/* Append an item  to the worklist */
+static aw_workitem *aw_worklist_add(aw_worklist *w, Oid oid, bool isshared)
+{
+	aw_workitem		*newitem;
+
+	if (w->numitems >= w->maxitems)
+	{
+		/* grow the array */
+		w->maxitems = w->maxitems * 1.5;
+		w->items = (aw_workitem *) repalloc(w->items,
+											w->maxitems * sizeof(aw_workitem));
+	}
+	newitem = w->items + w->numitems++;
+	newitem->relid = oid,
+	newitem->sharedrel = isshared;
+	return newitem;
+}
+
+/* Release extra memory that might have been allocated during growth */
+static void aw_worklist_trim(aw_worklist *w)
+{
+	if (w->maxitems >= w->numitems)
+	{
+		w->maxitems = w->numitems;
+		w->items = (aw_workitem *) repalloc(w->items,
+											w->numitems * sizeof(aw_workitem));
+	}
+}
+
+/* qsort comparator for aw_workitem items */
+static int
+aw_workitem_cmp(const void *p1, const void *p2)
+{
+	const aw_workitem *v1 = ((const aw_workitem *) p1);
+	const aw_workitem *v2 = ((const aw_workitem *) p2);
+
+	if (v1->relid < v2->relid)
+		return -1;
+	if (v1->relid > v2->relid)
+		return 1;
+	return 0;
+}
+
+static void aw_worklist_sort(aw_worklist *w)
+{
+	if (w->numitems > 1)
+		qsort(w->items, w->numitems, sizeof(aw_workitem), aw_workitem_cmp);
+}
+
+
 /*
  * do_start_worker
  *
@@ -1899,10 +1989,10 @@ do_autovacuum(void)
 	HeapTuple	tuple;
 	HeapScanDesc relScan;
 	Form_pg_database dbForm;
-	List	   *table_oids = NIL;
 	HASHCTL		ctl;
 	HTAB	   *table_toast_map;
-	ListCell   *volatile cell;
+	aw_worklist	table_oids;
+	int			oid_idx;
 	PgStat_StatDBEntry *shared;
 	PgStat_StatDBEntry *dbentry;
 	BufferAccessStrategy bstrategy;
@@ -1912,6 +2002,7 @@ do_autovacuum(void)
 	bool		did_vacuum = false;
 	bool		found_concurrent_worker = false;
 
+
 	/*
 	 * StartTransactionCommand and CommitTransactionCommand will automatically
 	 * switch to other contexts.  We need this one to keep the list of
@@ -1993,6 +2084,9 @@ do_autovacuum(void)
 								  &ctl,
 								  HASH_ELEM | HASH_BLOBS);
 
+	/* create array/list to track oids of all the tables that might need action */
+	aw_worklist_init(&table_oids);
+
 	/*
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
@@ -2084,9 +2178,9 @@ do_autovacuum(void)
 		}
 		else
 		{
-			/* relations that need work are added to table_oids */
+			/* relations that may need work are added to table_oids */
 			if (dovacuum || doanalyze)
-				table_oids = lappend_oid(table_oids, relid);
+				(void) aw_worklist_add(&table_oids, relid, classForm->relisshared);
 
 			/*
 			 * Remember the association for the second pass.  Note: we must do
@@ -2170,7 +2264,8 @@ do_autovacuum(void)
 
 		/* ignore analyze for toast tables */
 		if (dovacuum)
-			table_oids = lappend_oid(table_oids, relid);
+			aw_worklist_add(&table_oids, relid, classForm->relisshared);
+
 	}
 
 	heap_endscan(relScan);
@@ -2193,12 +2288,30 @@ do_autovacuum(void)
 
 	/*
 	 * Perform operations on collected tables.
+	 *
+	 * 1. Sort the collected oids so that workers process oids in
+	 *    the same order.
+	 * 2. While there are items not yet visited do:
+	 *    3. Find the last (highest) item worked on by any worker.
+	 *    4. Claim the next item after the last one worked on.
+	 * 
+	 * The ordering prevents workers from racing or competing with other
+	 * workers for an item to work on. Without it there is no way to
+	 * prevent them from trying to vacuum the same table repeatedly.
 	 */
-	foreach(cell, table_oids)
+
+	/* 1.Sort the oids to establish a consistent order for work items. */
+	aw_worklist_trim(&table_oids);	/* release any extra allocated memory */
+	aw_worklist_sort(&table_oids);
+
+	/*
+	 * 2. While there are items that have not been visited...
+	 */
+	for(oid_idx = 0; oid_idx < table_oids.numitems; )
 	{
-		Oid			relid = lfirst_oid(cell);
 		autovac_table *tab;
-		bool		skipit;
+		Oid			relid;
+		Oid			highest_oid_claimed;
 		int			stdVacuumCostDelay;
 		int			stdVacuumCostLimit;
 		dlist_iter	iter;
@@ -2222,71 +2335,61 @@ do_autovacuum(void)
 		}
 
 		/*
-		 * hold schedule lock from here until we're sure that this table still
-		 * needs vacuuming.  We also need the AutovacuumLock to walk the
-		 * worker array, but we'll let go of that one quickly.
+		 * We hold the Autovacuum lock to walk the worker array and to
+		 * update our WorkerInfo with the chosen candidate table.
 		 */
-		LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
-		LWLockAcquire(AutovacuumLock, LW_SHARED);
+		LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
 
 		/*
-		 * Check whether the table is being vacuumed concurrently by another
-		 * worker.
+		 * 3. Find the last (highest) oid claimed by any worker.
 		 */
-		skipit = false;
+		highest_oid_claimed = InvalidOid;
 		dlist_foreach(iter, &AutoVacuumShmem->av_runningWorkers)
 		{
 			WorkerInfo	worker = dlist_container(WorkerInfoData, wi_links, iter.cur);
 
-			/* ignore myself */
-			if (worker == MyWorkerInfo)
-				continue;
+			if ((worker->wi_sharedrel || worker->wi_dboid == MyDatabaseId)
+				&& worker->wi_tableoid >= highest_oid_claimed)
+			{
+				highest_oid_claimed = worker->wi_tableoid;
+				found_concurrent_worker = true;
+			}
+		}
 
-			/* ignore workers in other databases (unless table is shared) */
-			if (!worker->wi_sharedrel && worker->wi_dboid != MyDatabaseId)
-				continue;
+		/*
+		 * 4a. Skip past the highest_oid_claimed to find the next oid.
+		 */
+		for (relid = InvalidOid; oid_idx < table_oids.numitems; oid_idx++)
+		{
+			aw_workitem	*item = &table_oids.items[oid_idx];
 
-			if (worker->wi_tableoid == relid)
+			/* 4b. Claim the table by storing its oid in shared memory. */
+			if (item->relid > highest_oid_claimed)
 			{
-				skipit = true;
-				found_concurrent_worker = true;
+				relid = item->relid;
+				MyWorkerInfo->wi_tableoid = relid;
+				MyWorkerInfo->wi_sharedrel = item->sharedrel;
 				break;
 			}
 		}
+
 		LWLockRelease(AutovacuumLock);
-		if (skipit)
-		{
-			LWLockRelease(AutovacuumScheduleLock);
+
+		/* If we reached the end of the work list there is nothing to do. */
+		if (relid == InvalidOid)
 			continue;
-		}
 
 		/*
-		 * Check whether pgstat data still says we need to vacuum this table.
-		 * It could have changed if something else processed the table while
-		 * we weren't looking.
-		 *
-		 * Note: we have a special case in pgstat code to ensure that the
-		 * stats we read are as up-to-date as possible, to avoid the problem
-		 * that somebody just finished vacuuming this table.  The window to
-		 * the race condition is not closed but it is very small.
+		 * Re-check whether we still need to vacuum this table. We are working
+		 * from a list of tables that could be quite stale by now. Possibly
+		 * this able was dropped or manually vacuumed while we weren't looking.
 		 */
 		MemoryContextSwitchTo(AutovacMemCxt);
 		tab = table_recheck_autovac(relid, table_toast_map, pg_class_desc,
 									effective_multixact_freeze_max_age);
 		if (tab == NULL)
-		{
 			/* someone else vacuumed the table, or it went away */
-			LWLockRelease(AutovacuumScheduleLock);
 			continue;
-		}
-
-		/*
-		 * Ok, good to go.  Store the table in shared memory before releasing
-		 * the lock so that other workers don't vacuum it concurrently.
-		 */
-		MyWorkerInfo->wi_tableoid = relid;
-		MyWorkerInfo->wi_sharedrel = tab->at_sharedrel;
-		LWLockRelease(AutovacuumScheduleLock);
 
 		/*
 		 * Remember the prevailing values of the vacuum cost GUCs.  We have to
@@ -2407,6 +2510,8 @@ deleted:
 		VacuumCostLimit = stdVacuumCostLimit;
 	}
 
+	aw_worklist_clear(&table_oids);	/* free work list memory */
+
 	/*
 	 * We leak table_toast_map here (among other things), but since we're
 	 * going away soon, it's not a problem.
@@ -2514,9 +2619,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 	bool		wraparound;
 	AutoVacOpts *avopts;
 
-	/* use fresh stats */
-	autovac_refresh_stats();
-
+	/* Fetch pgstats for re-checking. */
 	shared = pgstat_fetch_stat_dbentry(InvalidOid);
 	dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
@@ -2995,28 +3098,19 @@ AutoVacuumShmemInit(void)
  *
  * Cause the next pgstats read operation to obtain fresh data, but throttle
  * such refreshing in the autovacuum launcher.  This is mostly to avoid
- * rereading the pgstats files too many times in quick succession when there
- * are many databases.
- *
- * Note: we avoid throttling in the autovac worker, as it would be
- * counterproductive in the recheck logic.
+ * rereading the pgstats files too many times in quick succession.
  */
 static void
 autovac_refresh_stats(void)
 {
-	if (IsAutoVacuumLauncherProcess())
-	{
-		static TimestampTz last_read = 0;
-		TimestampTz current_time;
-
-		current_time = GetCurrentTimestamp();
-
-		if (!TimestampDifferenceExceeds(last_read, current_time,
-										STATS_READ_DELAY))
-			return;
+	static TimestampTz last_read = 0;
+	TimestampTz current_time;
 
-		last_read = current_time;
-	}
+	current_time = GetCurrentTimestamp();
+	if (!TimestampDifferenceExceeds(last_read, current_time,
+									STATS_READ_DELAY))
+		return;
 
+	last_read = current_time;
 	pgstat_clear_snapshot();
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 480d3b1876..54c4ca97a3 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -76,6 +76,9 @@
 #define PGSTAT_STAT_INTERVAL	500		/* Minimum time between stats file
 										 * updates; in milliseconds. */
 
+#define PGSTAT_AVAC_INTERVAL	5000	/* Minimum time between stats file
+										 * updates for autovacuum workers */
+
 #define PGSTAT_RETRY_DELAY		10		/* How long to wait between checks for
 										 * a new file; in milliseconds. */
 
@@ -4759,10 +4762,9 @@ backend_read_statsfile(void)
 			/*
 			 * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
 			 * msec before now.  This indirectly ensures that the collector
-			 * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-			 * an autovacuum worker, however, we want a lower delay to avoid
-			 * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-			 * number of workers is low, this shouldn't be a problem).
+			 * needn't write the file more often than PGSTAT_STAT_INTERVAL.
+			 * Autovacuum workers use a longer interval PGSTAT_AVAC_INTERVAL to
+			 * avoid requesting excessively frequent rewrites of the stats.
 			 *
 			 * We don't recompute min_ts after sleeping, except in the
 			 * unlikely case that cur_ts went backwards.  So we might end up
@@ -4773,12 +4775,10 @@ backend_read_statsfile(void)
 			 * actually accept.
 			 */
 			ref_ts = cur_ts;
-			if (IsAutoVacuumWorkerProcess())
-				min_ts = TimestampTzPlusMilliseconds(ref_ts,
-													 -PGSTAT_RETRY_DELAY);
-			else
-				min_ts = TimestampTzPlusMilliseconds(ref_ts,
-													 -PGSTAT_STAT_INTERVAL);
+			min_ts = TimestampTzPlusMilliseconds(ref_ts,
+												 IsAutoVacuumWorkerProcess()
+													? -PGSTAT_AVAC_INTERVAL
+													: -PGSTAT_STAT_INTERVAL);
 		}
 
 		/*
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f8996cd21a..63fb170f19 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -27,7 +27,7 @@ TablespaceCreateLock				19
 BtreeVacuumLock						20
 AddinShmemInitLock					21
 AutovacuumLock						22
-AutovacuumScheduleLock				23
+# 23 is available; was formerly AutovacuumScheduleLock
 SyncScanLock						24
 RelationMappingLock					25
 AsyncCtlLock						26

#28

daveg@sonic.net

almost 8 years ago

In reply to: Tom Lane (#25)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

On Mon, 12 Mar 2018 12:21:34 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Maybe this type of situation is an argument for trusting an ANALYZE-based
estimate more than the VACUUM-based estimate. I remain uncomfortable with
that in cases where VACUUM looked at much more of the table than ANALYZE
did, though. Maybe we need some heuristic based on the number of pages
actually visited by each pass?

I looked into doing something like that. It's possible, but it's fairly
invasive; there's no clean way to compare those page counts without
altering the API of acquire_sample_rows() to pass back the number of pages
it visited. That would mean a change in FDW-visible APIs. We could do
that, but I don't want to insist on it when there's nothing backing it up
except a fear that *maybe* ANALYZE's estimate will be wrong often enough
to worry about.

So at this point I'm prepared to go forward with your patch, though not
to risk back-patching it. Field experience will tell us if we need to
do more. I propose the attached cosmetic refactoring, though.

I like the re-factor. Making vac_estimate_reltuples() specific to the
special case of vacuum and having the normal analyze case just in analyze
seems like an improvement overall.

It it helps I have been experimenting with your thought experiment (update
first 20% of rows, then delete 50% of those) to try to trick analyze. I
built test scripts and generate data and found the the current system
after vacuum is usually about 8% to 10% off on reltuples. Analyze moves it
very slowly closer to the true count. With the patch analyze nails it
immediately.

To see how this was affected by the relationship of table size and sample
I created another test setup just to iterate analyze runs and compare to
the true count. fter a couple thousand analyzes with various table mutations
and table sizes up to 100 M rows, (1.8 M pages) and
default_statistics_targets ranging from 1 to 1000 I am pretty confident that
we can get 2% accuracy for any size table even with the old statistics
target. The table size does not really matter much, the error drops and
clusters more tightly as sample size increases but past 10000 or so it's well
into diminishing returns.

Summary of 100 analyze runs for each line below.
Errors and sample fraction are in percent for easy reading.
/ ----------- percent ---------------/
testcase | Mrows | stats | pages | sample | fraction | maxerr | avgerr | stddev
-----------+-------+-------+---------+--------+----------+--------+--------+---------
first-last | 10 | 1 | 163935 | 300 | 0.001830 | 6.6663 | 2.3491 | 2.9310
first-last | 10 | 3 | 163935 | 900 | 0.005490 | 3.8886 | 1.2451 | 1.5960
first-last | 10 | 10 | 163935 | 3000 | 0.018300 | 2.8337 | 0.7539 | 0.9657
first-last | 10 | 33 | 163935 | 9900 | 0.060390 | 1.4903 | 0.3723 | 0.4653
first-last | 10 | 100 | 163935 | 30000 | 0.182999 | 0.6580 | 0.2221 | 0.2707
first-last | 10 | 333 | 163935 | 99900 | 0.609388 | 0.1960 | 0.0758 | 0.0897
first-last | 100 | 1 | 1639345 | 300 | 0.000183 | 8.7500 | 2.2292 | 2.8685
first-last | 100 | 3 | 1639345 | 900 | 0.000549 | 5.4166 | 1.1695 | 1.5431
first-last | 100 | 10 | 1639345 | 3000 | 0.001830 | 1.7916 | 0.6258 | 0.7593
first-last | 100 | 33 | 1639345 | 9900 | 0.006039 | 1.8182 | 0.4141 | 0.5433
first-last | 100 | 100 | 1639345 | 30000 | 0.018300 | 0.9417 | 0.2464 | 0.3098
first-last | 100 | 333 | 1639345 | 99900 | 0.060939 | 0.4642 | 0.1206 | 0.1542
first-last | 100 | 1000 | 1639345 | 300000 | 0.183000 | 0.2192 | 0.0626 | 0.0776
un-updated | 10 | 1 | 180328 | 300 | 0.001664 | 7.9259 | 2.2845 | 2.7806
un-updated | 10 | 3 | 180328 | 900 | 0.004991 | 4.2964 | 1.2923 | 1.5990
un-updated | 10 | 10 | 180328 | 3000 | 0.016636 | 2.2593 | 0.6734 | 0.8271
un-updated | 10 | 33 | 180328 | 9900 | 0.054900 | 0.9260 | 0.3305 | 0.3997
un-updated | 10 | 100 | 180328 | 30000 | 0.166364 | 1.0162 | 0.2024 | 0.2691
un-updated | 10 | 333 | 180328 | 99900 | 0.553991 | 0.2058 | 0.0683 | 0.0868
un-updated | 100 | 1 | 1803279 | 300 | 0.000166 | 7.1111 | 1.8793 | 2.3820
un-updated | 100 | 3 | 1803279 | 900 | 0.000499 | 3.8889 | 1.0586 | 1.3265
un-updated | 100 | 10 | 1803279 | 3000 | 0.001664 | 2.1407 | 0.6710 | 0.8376
un-updated | 100 | 33 | 1803279 | 9900 | 0.005490 | 1.1728 | 0.3779 | 0.4596
un-updated | 100 | 100 | 1803279 | 30000 | 0.016636 | 0.6256 | 0.1983 | 0.2551
un-updated | 100 | 333 | 1803279 | 99900 | 0.055399 | 0.3454 | 0.1181 | 0.1407
un-updated | 100 | 1000 | 1803279 | 300000 | 0.166364 | 0.1738 | 0.0593 | 0.0724

I also thought about the theory and am confident that there really is no way
to trick it. Basically if there are enough pages that are different to affect
the overall density, say 10% empty or so, there is no way a random sample
larger than a few hundred probes can miss them no matter how big the table is.
If there are few enough pages to "hide" from the sample, then they are so few
they don't matter anyway.

After all this my vote is for back patching too. I don't see any case where
the patched analyze is or could be worse than what we are doing. I'm happy to
provide my test cases if anyone is interested.

Thanks

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#29

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: David Gould (#28)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

David Gould <daveg@sonic.net> writes:

I also thought about the theory and am confident that there really is no way
to trick it. Basically if there are enough pages that are different to affect
the overall density, say 10% empty or so, there is no way a random sample
larger than a few hundred probes can miss them no matter how big the table is.
If there are few enough pages to "hide" from the sample, then they are so few
they don't matter anyway.

After all this my vote is for back patching too. I don't see any case where
the patched analyze is or could be worse than what we are doing. I'm happy to
provide my test cases if anyone is interested.

Yeah, you have a point. I'm still worried about unexpected side-effects,
but it seems like overall this is very unlikely to hurt anyone. I'll
back-patch (minus the removal of the unneeded vac_estimate_reltuples
argument).

regards, tom lane

#30

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: David Gould (#27)

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

David Gould <daveg@sonic.net> writes:

I have attached the patch we are currently using. It applies to 9.6.8. I
have versions for older releases in 9.4, 9.5, 9.6. I fails to apply to 10,
and presumably head but I can update it if there is any interest.

The patch has three main features:
- Impose an ordering on the autovacuum workers worklist to avoid
the need for rechecking statistics to skip already vacuumed tables.
- Reduce the frequency of statistics refreshes
- Remove the AutovacuumScheduleLock

As per the earlier thread, the first aspect of that needs more work to
not get stuck when the worklist has long tasks near the end. I don't
think you're going to get away with ignoring that concern.

Perhaps we could sort the worklist by decreasing table size? That's not
an infallible guide to the amount of time that a worker will need to
spend, but it's sure safer than sorting by OID.

Alternatively, if we decrease the frequency of stats refreshes, how
much do we even need to worry about reordering the worklist?

In any case, I doubt anyone will have any appetite for back-patching
such a change. I'd recommend that you clean up your patch and rebase
to HEAD, then submit it into the September commitfest (either on a
new thread or a continuation of the old #13750 thread, not this one).

regards, tom lane

#31