Possible fails in pg_stat_statements test

Started by Anton A. Melnikovover 4 years ago9 messageshackers
Jump to latest
#1Anton A. Melnikov
aamelnikov@inbox.ru

Hello,

There are some places in the pg_state_statement's regress test where the
bool result of comparison between the number of rows obtained and
wal_records generated by query should be displayed.
Now counting the number of wal_records for some query in
pg_state_statement is done by the global pgWalUsage.wal_records counter
difference calculation.
During query execution the extra wal_records may appear that are not
relate to the query.
There are two reasons why this might happen:
1) Owing to the hit into pruning of some page in optional pruning
function (heap_page_prune_opt()).
2) When a new page is required for a new xid in clog and
WriteZeroPageXlogRec() was called.
In both cases an extra wal record with zero xl_xid is generated, so
wal_records counter gives an incremented value for this query and
pg_stat_statement test will fail.

This patch introduces an additional counter of wal records not related
to the query being executed.
Due to this counter pg_stat_statement finds out the number of wal
records that are not relevant to the query and does not include them in
the per query statistic.
This removes the possibility of the error described above.

There is a way to reproduce this error when patch is not applied:
1) start server with "shared_preload_libraries = 'pg_stat_statements'"
string in the postgresql.conf;
2) replace makefile in contrib/pg_stat_statements with attached one;
3) replace test file
contrib/pg_stat_statements/sql/pg_stat_statements.sql and expected
results contrib/pg_stat_statements/expected/pg_stat_statements.out
with shorter versions from attached files;
4) copy test.sh to contrib/pg_stat_statements and make sure that PGHOME
point to your server;
5) cd to contrib/pg_stat_statements and execute:
export ITER=1 && while ./start.sh || break; export ITER=$(($ITER+1)); do
:; done

Usually 100-200 iterations will be enough.
To catch the error more faster one can add wal_records column to SELECT
in line 26 of contrib/pg_stat_statements/sql/pg_stat_statements.sql as
followes:
SELECT query, calls, rows, wal_records,
and replace the contrib/pg_stat_statements/expected/pg_stat_statements.out
with attached pg_stat_statements-fast.out

With best regards,
--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v1-0001-Fix-possible-fails-in-pg_stat_statements-test.patchtext/x-patch; charset=UTF-8; name=v1-0001-Fix-possible-fails-in-pg_stat_statements-test.patchDownload+21-1
Makefiletext/plain; charset=UTF-8; name=MakefileDownload
pg_stat_statements.sqlapplication/sql; name=pg_stat_statements.sqlDownload
pg_stat_statements.outtext/plain; charset=UTF-8; name=pg_stat_statements.outDownload
start.shapplication/x-shellscript; name=start.shDownload
pg_stat_statements-fast.outtext/plain; charset=UTF-8; name=pg_stat_statements-fast.outDownload
#2Anton A. Melnikov
aamelnikov@inbox.ru
In reply to: Anton A. Melnikov (#1)
Re: Possible fails in pg_stat_statements test

Hello!

Here is the second version of the patch rebased onto the current master.
No logical changes.
All other attached files from previous letter are actual.

With best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v2-0001-Fix-possible-fails-in-pg_stat_statements-test.patchtext/x-patch; charset=UTF-8; name=v2-0001-Fix-possible-fails-in-pg_stat_statements-test.patchDownload+21-1
#3Andres Freund
andres@anarazel.de
In reply to: Anton A. Melnikov (#1)
Re: Possible fails in pg_stat_statements test

Hi,

On 2022-01-14 11:11:07 +0300, Anton A. Melnikov wrote:

This patch introduces an additional counter of wal records not related to
the query being executed.

They're not unrelated though.

Due to this counter pg_stat_statement finds out the number of wal records
that are not relevant to the query and does not include them in the per
query statistic.

-many. For read-only queries the generated WAL due to on-access pruning can be
a significant factor in performance. Removing that information makes
pg_stat_statments *less* useful.

This removes the possibility of the error described above.

There is a way to reproduce this error when patch is not applied:
1) start server with "shared_preload_libraries = 'pg_stat_statements'"
string in the postgresql.conf;
2) replace makefile in contrib/pg_stat_statements with attached one;
3) replace test file contrib/pg_stat_statements/sql/pg_stat_statements.sql
and expected results
contrib/pg_stat_statements/expected/pg_stat_statements.out
with shorter versions from attached files;
4) copy test.sh to contrib/pg_stat_statements and make sure that PGHOME
point to your server;
5) cd to contrib/pg_stat_statements and execute:
export ITER=1 && while ./start.sh || break; export ITER=$(($ITER+1)); do :;
done

Usually 100-200 iterations will be enough.
To catch the error more faster one can add wal_records column to SELECT
in line 26 of contrib/pg_stat_statements/sql/pg_stat_statements.sql as
followes:
SELECT query, calls, rows, wal_records,
and replace the contrib/pg_stat_statements/expected/pg_stat_statements.out
with attached pg_stat_statements-fast.out

Can the test failures be encountered without such an elaborate setup? If not,
then I don't really see why we need to do anything here?

Greetings,

Andres Freund

#4Anton A. Melnikov
aamelnikov@inbox.ru
In reply to: Andres Freund (#3)
Re: Possible fails in pg_stat_statements test

Hello,

thank you much for your attention and for your thought.

On 20.03.2022 20:36, Andres Freund wrote:

This patch introduces an additional counter of wal records not related to
the query being executed.

They're not unrelated though.

Yes, i've missformulated here.
Indeed there is a relation but it seems of the some other kind.
It would be nice to clarify the terminology.
Maybe divide WAL records into two kinds:
1) WAL records, the number of which depends on the given query itself.
(say strong relation)
2) WAL records, the number of which depends on the given query and on
the previous query history. (say weak relation)

So modified in the patch wal_records counter will belongs to the first
kind while the number of wal records due to on-access pruning and new
clog page generation to the second.

-many. For read-only queries the generated WAL due to on-access pruning can be
a significant factor in performance. Removing that information makes
pg_stat_statments *less* useful.

A separate counter for the second type of records, say,
extra_wal_records, will not only remove this disadvantage, but on the
contrary will provide additional information.

The next version of the patch with additional counter is attached.

Really, now it is clearly seen that sometimes

WAL due to on-access pruning can be a significant factor !

After pgbench -c10 -t300:
postgres=# SELECT substring(query for 30), wal_records,
extra_wal_records FROM pg_stat_statements WHERE extra_wal_records != 0;

substring | wal_records | extra_wal_records
--------------------------------+-------------+-------------------
UPDATE pgbench_tellers SET tba | 4557 | 15
create table pgbench_history(t | 48 | 1
create table pgbench_branches( | 40 | 1
UPDATE pgbench_accounts SET ab | 5868 | 1567
drop table if exists pgbench_a | 94 | 1
UPDATE pgbench_branches SET bb | 5993 | 14
SELECT abalance FROM pgbench_a | 0 | 7
(7 rows)

Can the test failures be encountered without such an elaborate setup? If not,
then I don't really see why we need to do anything here?

There was a real bug report from our test department. They do long time
repetitive tests and sometimes met this failure.
So i suppose there is a non-zero probability that such error can occur
in the one-shot test as well.
The sequence given in the first letter helps to catch this failure quickly.

With best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v3-0001-Fix-possible-fails-in-pg_stat_statements-test.patchtext/x-patch; charset=UTF-8; name=v3-0001-Fix-possible-fails-in-pg_stat_statements-test.patchDownload+145-27
#5Robert Haas
robertmhaas@gmail.com
In reply to: Anton A. Melnikov (#4)
Re: Possible fails in pg_stat_statements test

On Wed, Mar 30, 2022 at 2:20 AM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

Can the test failures be encountered without such an elaborate setup? If not,
then I don't really see why we need to do anything here?

There was a real bug report from our test department. They do long time
repetitive tests and sometimes met this failure.
So i suppose there is a non-zero probability that such error can occur
in the one-shot test as well.
The sequence given in the first letter helps to catch this failure quickly.

I don't think that the idea of "extra" WAL records is very principled.
It's pretty vague what "extra" means, and your definition seems to be
basically "whatever would be needed to make this test case pass." I
think the problem is basically with the test cases's idea that # of
WAL records and # of table rows ought to be equal. I think that's just
false. In general, we'd also have to worry about index insertions,
which would provoke variable numbers of WAL records depending on
whether they cause a page split. And we'd have to worry about TOAST
table insertions, which could produce different numbers of records
depending on the size of the data, the configured block size and TOAST
threshold, and whether the TOAST table index incurs a page split. So
even if we added a mechanism like what you propose here, we would only
be fixing this particular test case, not creating infrastructure of
any general utility.

If it's true that this test case sometimes randomly fails, then we
ought to fix that somehow, maybe by just removing this particular
check from the test case, or changing it to >=, or something like
that. But I don't think adding a new counter is the right idea.

--
Robert Haas
EDB: http://www.enterprisedb.com

#6Anton A. Melnikov
aamelnikov@inbox.ru
In reply to: Robert Haas (#5)
Re: Possible fails in pg_stat_statements test

Hello!

On 30.03.2022 22:36, Robert Haas wrote:

I don't think that the idea of "extra" WAL records is very principled.
It's pretty vague what "extra" means, and your definition seems to be
basically "whatever would be needed to make this test case pass." I
think the problem is basically with the test cases's idea that # of
WAL records and # of table rows ought to be equal. I think that's just
false. In general, we'd also have to worry about index insertions,
which would provoke variable numbers of WAL records depending on
whether they cause a page split. And we'd have to worry about TOAST
table insertions, which could produce different numbers of records
depending on the size of the data, the configured block size and TOAST
threshold, and whether the TOAST table index incurs a page split.

Thank you very much for this information. I really didn't take it into
account.

If it's true that this test case sometimes randomly fails, then we
ought to fix that somehow, maybe by just removing this particular
check from the test case, or changing it to >=, or something like
that. But I don't think adding a new counter is the right idea.

Indeed. Then there is a very simple solution for this particular case as
wal_records counter may only sometime becomes greater but never less.
The corresponding patch is attached.

With best regards,
--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v4-0001-Fix-possible-fails-in-pg_stat_statements-test.patchtext/x-patch; charset=UTF-8; name=v4-0001-Fix-possible-fails-in-pg_stat_statements-test.patchDownload+5-5
#7Julien Rouhaud
rjuju123@gmail.com
In reply to: Anton A. Melnikov (#6)
Re: Possible fails in pg_stat_statements test

Hi,

On Thu, Mar 31, 2022 at 06:08:01PM +0300, Anton A. Melnikov wrote:

Hello!

On 30.03.2022 22:36, Robert Haas wrote:

I don't think that the idea of "extra" WAL records is very principled.
It's pretty vague what "extra" means, and your definition seems to be
basically "whatever would be needed to make this test case pass."

I agree, and even it there was a better definition there probably isn't much to
learn from it.

I

think the problem is basically with the test cases's idea that # of
WAL records and # of table rows ought to be equal. I think that's just
false. In general, we'd also have to worry about index insertions,
which would provoke variable numbers of WAL records depending on
whether they cause a page split. And we'd have to worry about TOAST
table insertions, which could produce different numbers of records
depending on the size of the data, the configured block size and TOAST
threshold, and whether the TOAST table index incurs a page split.

Indeed, we added this test as it was hitting only a few queries with small
rows, which we thought would be stable, but that's apparently not the case. I
think the reason we never had any problem is that the buildfarm currently
doesn't run pg_stat_statement regression test, as it's marked as
NO_INSTALLCHECK. Other CI systems like at pgpro evidently have a different
approach.

If it's true that this test case sometimes randomly fails, then we
ought to fix that somehow, maybe by just removing this particular
check from the test case, or changing it to >=, or something like
that. But I don't think adding a new counter is the right idea.

Indeed. Then there is a very simple solution for this particular case as
wal_records counter may only sometime becomes greater but never less.
The corresponding patch is attached.

+1 for this approach, and the patch looks good to me.

#8Robert Haas
robertmhaas@gmail.com
In reply to: Julien Rouhaud (#7)
Re: Possible fails in pg_stat_statements test

On Thu, Mar 31, 2022 at 12:00 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Indeed. Then there is a very simple solution for this particular case as
wal_records counter may only sometime becomes greater but never less.
The corresponding patch is attached.

+1 for this approach, and the patch looks good to me.

Committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Anton A. Melnikov
aamelnikov@inbox.ru
In reply to: Robert Haas (#8)
Re: Possible fails in pg_stat_statements test

On 06.07.2022 20:11, Robert Haas wrote:

On Thu, Mar 31, 2022 at 12:00 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Indeed. Then there is a very simple solution for this particular case as
wal_records counter may only sometime becomes greater but never less.
The corresponding patch is attached.

+1 for this approach, and the patch looks good to me.

Committed.

Thanks a lot!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company