BUG #18046: stats collection behaviour change is affecting the usability of information.

Started by PG Bug reporting formover 2 years ago10 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 18046
Logged by: Jobin Augustine
Email address: jobinau@gmail.com
PostgreSQL version: 15.3
Operating system: CentOS/RHEL
Description:

After stats collection changes in PG 15, The behaviour stats_reset
information is changed. which is adversely affecting the usability of data
presented in the view.
```
select stats_reset from pg_stat_database;
```
Before PG15, stats_reset was always populated so that the user knows the
cumulative values presented are of how many days.
But unfortunately, PG 15 keeps it null unless there is an explicit reset
(pg_stat_reset). The value goes again null if there is a reset due to a
crash.
So on a regular system, the User can't understand the cumulative values
presented in the view.

#2Hamid Akhtar
hamid.akhtar@percona.com
In reply to: PG Bug reporting form (#1)
Re: BUG #18046: stats collection behaviour change is affecting the usability of information.

On Tue, 1 Aug 2023 at 22:01, PG Bug reporting form <noreply@postgresql.org>
wrote:

The following bug has been logged on the website:

Bug reference: 18046
Logged by: Jobin Augustine
Email address: jobinau@gmail.com
PostgreSQL version: 15.3
Operating system: CentOS/RHEL
Description:

After stats collection changes in PG 15, The behaviour stats_reset
information is changed. which is adversely affecting the usability of data
presented in the view.
```
select stats_reset from pg_stat_database;
```
Before PG15, stats_reset was always populated so that the user knows the
cumulative values presented are of how many days.
But unfortunately, PG 15 keeps it null unless there is an explicit reset
(pg_stat_reset). The value goes again null if there is a reset due to a
crash.
So on a regular system, the User can't understand the cumulative values
presented in the view.

Thank you for the bug report Jobin.

IMHO, this is a valid concern. As per the documentation, the "stats_reset"
column tracks the last time the stats were reset. There is no mention of
this being timestamp for manual reset only.

Internally, the server calls pgstat_reset_after_failure whenever it cannot
find the stats file or if the server is recovering from a crash. In case of
a crash, the stats may no longer be valid. Therefore, internally the stats
are dropped, and a new file is written.

Attach is a fix for PG16 and PG15 that resolves this issue. It ensures that
when the database stats are being written to disk and the stats_reset is
not set, it adds the current timestamp to it. Since a new file is written
at initdb and when the server is recovering from a crash, this works as
expected.

Attachments:

pg16_stat_reset_fix.patchapplication/octet-stream; name=pg16_stat_reset_fix.patchDownload+4-0
#3Jobin Augustine
jobinau@gmail.com
In reply to: Hamid Akhtar (#2)
Re: BUG #18046: stats collection behaviour change is affecting the usability of information.

Thank you Hamid for working on this and coming with a fix.

On Fri, Aug 4, 2023 at 4:53 PM Hamid Akhtar <hamid.akhtar@percona.com>
wrote:

Thank you for the bug report Jobin.

IMHO, this is a valid concern. As per the documentation, the "stats_reset"
column tracks the last time the stats were reset. There is no mention of
this being timestamp for manual reset only.

Without this base info, users don't have the option to understand the

cumulative statistics in the stats view

Attach is a fix for PG16 and PG15 that resolves this issue. It ensures
that when the database stats are being written to disk and the stats_reset
is not set, it adds the current timestamp to it. Since a new file is
written at initdb and when the server is recovering from a crash, this
works as expected.

I can confirm that this patch fixes the problem.
I could find simple steps to reproduce the original problem independently.

Step 1 : Create a new database
CREATE DATABASE db1;

Step 2. Create a table in the database
\c db1
CREATE TABLE t1 (id INT);

Step 3. Check the timestamp of the start of database-level statistics
db1=# SELECT datname,stats_reset FROM pg_stat_database;

Expected behaviour(works in all versions upto and including PostgreSQL 14)

datname | stats_reset
-----------+-------------------------------
| 2023-08-02 06:41:15.777135+00
postgres | 2023-08-02 06:41:15.777108+00
template1 |
template0 |
db1 | 2023-08-02 11:02:54.954363+00
(5 rows)

The problem in PostgreSQL 15 and above

datname | stats_reset
-----------+------------------------------
|
postgres |
db1 |
template1 |
template0 |
(5 rows)

Once again, Thank you for the fix.

Jobin Augustine.

#4Bruce Momjian
bruce@momjian.us
In reply to: Jobin Augustine (#3)
Re: BUG #18046: stats collection behaviour change is affecting the usability of information.

Andres, can you comment on this thread? I see you had a commit to PG
15 in this area:

commit 5cd1c40b3c
Author: Andres Freund <andres@anarazel.de>
Date: Thu Apr 14 17:40:25 2022 -0700

pgstat: set timestamps of fixed-numbered stats after a crash.

When not loading stats at startup (i.e. pgstat_discard_stats() getting
called), reset timestamps of fixed numbered stats would be left at
0. Oversight in 5891c7a8ed8.

Instead use pgstat_reset_after_failure() and add tests verifying that
fixed-numbered reset timestamps are set appropriately.

Reported-By: "David G. Johnston" <david.g.johnston@gmail.com>
Discussion: /messages/by-id/CAKFQuwamFuaQHKdhcMt4Gbw5+Hca2UE741B8gOOXoA=TtAd2Yw@mail.gmail.com

Thanks.

---------------------------------------------------------------------------

On Fri, Aug 4, 2023 at 10:40:46PM +0530, Jobin Augustine wrote:

Thank you Hamid for working on this and coming with a fix.

On Fri, Aug 4, 2023 at 4:53 PM Hamid Akhtar <hamid.akhtar@percona.com> wrote:

Thank you for the bug report Jobin.

IMHO, this is a valid concern. As per the documentation, the "stats_reset"
column tracks the last time the stats were reset. There is no mention of
this being timestamp for manual reset only.

Without this base info, users don't have the option to understand the
cumulative statistics in the stats view
 

Attach is a fix for PG16 and PG15 that resolves this issue. It ensures that
when the database stats are being written to disk and the stats_reset is
not set, it adds the current timestamp to it. Since a new file is written
at initdb and when the server is recovering from a crash, this works as
expected.
 

I can confirm that this patch fixes the problem.
I could find simple steps to reproduce the original problem independently.

Step 1 : Create a new database
CREATE DATABASE db1;

Step 2. Create a table in the database
\c db1
CREATE TABLE t1 (id INT);

Step 3. Check the timestamp of the start of database-level statistics
db1=# SELECT datname,stats_reset FROM pg_stat_database;

Expected behaviour(works in all versions upto and including PostgreSQL 14)

  datname  |          stats_reset          
-----------+-------------------------------
           | 2023-08-02 06:41:15.777135+00
 postgres  | 2023-08-02 06:41:15.777108+00
 template1 | 
 template0 | 
 db1       | 2023-08-02 11:02:54.954363+00
(5 rows)

The problem in PostgreSQL 15 and above

  datname  |         stats_reset          
-----------+------------------------------
           | 
 postgres  | 
 db1       | 
 template1 | 
 template0 | 
(5 rows)

Once again, Thank you for the fix.

Jobin Augustine.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#5Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#4)
Re: BUG #18046: stats collection behaviour change is affecting the usability of information.

(This email has Andres properly in the TO field.)

Andres, can you comment on this thread? I see you had a commit to PG
15 in this area:

commit 5cd1c40b3c
Author: Andres Freund <andres@anarazel.de>
Date: Thu Apr 14 17:40:25 2022 -0700

pgstat: set timestamps of fixed-numbered stats after a crash.

When not loading stats at startup (i.e. pgstat_discard_stats() getting
called), reset timestamps of fixed numbered stats would be left at
0. Oversight in 5891c7a8ed8.

Instead use pgstat_reset_after_failure() and add tests verifying that
fixed-numbered reset timestamps are set appropriately.

Reported-By: "David G. Johnston" <david.g.johnston@gmail.com>
Discussion: /messages/by-id/CAKFQuwamFuaQHKdhcMt4Gbw5+Hca2UE741B8gOOXoA=TtAd2Yw@mail.gmail.com

Thanks.

---------------------------------------------------------------------------

On Fri, Aug 4, 2023 at 10:40:46PM +0530, Jobin Augustine wrote:

Thank you Hamid for working on this and coming with a fix.

On Fri, Aug 4, 2023 at 4:53 PM Hamid Akhtar <hamid.akhtar@percona.com> wrote:

Thank you for the bug report Jobin.

IMHO, this is a valid concern. As per the documentation, the "stats_reset"
column tracks the last time the stats were reset. There is no mention of
this being timestamp for manual reset only.

Without this base info, users don't have the option to understand the
cumulative statistics in the stats view
 

Attach is a fix for PG16 and PG15 that resolves this issue. It ensures that
when the database stats are being written to disk and the stats_reset is
not set, it adds the current timestamp to it. Since a new file is written
at initdb and when the server is recovering from a crash, this works as
expected.
 

I can confirm that this patch fixes the problem.
I could find simple steps to reproduce the original problem independently.

Step 1 : Create a new database
CREATE DATABASE db1;

Step 2. Create a table in the database
\c db1
CREATE TABLE t1 (id INT);

Step 3. Check the timestamp of the start of database-level statistics
db1=# SELECT datname,stats_reset FROM pg_stat_database;

Expected behaviour(works in all versions upto and including PostgreSQL 14)

  datname  |          stats_reset          
-----------+-------------------------------
           | 2023-08-02 06:41:15.777135+00
 postgres  | 2023-08-02 06:41:15.777108+00
 template1 | 
 template0 | 
 db1       | 2023-08-02 11:02:54.954363+00
(5 rows)

The problem in PostgreSQL 15 and above

  datname  |         stats_reset          
-----------+------------------------------
           | 
 postgres  | 
 db1       | 
 template1 | 
 template0 | 
(5 rows)

Once again, Thank you for the fix.

Jobin Augustine.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bruce Momjian (#4)
Re: BUG #18046: stats collection behaviour change is affecting the usability of information.

At Wed, 6 Sep 2023 12:57:01 -0400, Bruce Momjian <bruce@momjian.us> wrote in

Expected behaviour(works in all versions upto and including PostgreSQL 14)

  datname  |          stats_reset          
-----------+-------------------------------
           | 2023-08-02 06:41:15.777135+00
 postgres  | 2023-08-02 06:41:15.777108+00
 template1 | 
 template0 | 
 db1       | 2023-08-02 11:02:54.954363+00
(5 rows)

The problem in PostgreSQL 15 and above

  datname  |         stats_reset          
-----------+------------------------------
           | 
 postgres  | 
 db1       | 
 template1 | 
 template0 | 
(5 rows)

I agree it's not ideal to reset timestamps only for fixed-numbered
stats. In 13, the timestamp is set to the creation time. There's no
practical issue with doing it that way for now, I think it's
generallynot preferable.

We could use the transaction start time as a sufficient approximation
for the timestamp in fast paths. It might not align with with the
reset time set by pg_stat_reset(), but that shouldn't be a
problem. pgstat_restore_stats runs outside of a transaction, but
GetCurrentTimestamp() is already being used there. The reason for
defining a new function is to improve robustness, although it might
not be necessary.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

pgstat_set_reset_time_after_crash.patchtext/x-patch; charset=us-asciiDownload+28-3
#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#6)
Re: BUG #18046: stats collection behaviour change is affecting the usability of information.

At Thu, 07 Sep 2023 14:12:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

We could use the transaction start time as a sufficient approximation
for the timestamp in fast paths. It might not align with with the
reset time set by pg_stat_reset(), but that shouldn't be a
problem. pgstat_restore_stats runs outside of a transaction, but

To be accurate, it's not "outside of a transaction", but "before any
transaction starts".

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#5)
Re: BUG #18046: stats collection behaviour change is affecting the usability of information.

Hi,

On 2023-09-06 13:06:35 -0400, Bruce Momjian wrote:

(This email has Andres properly in the TO field.)

Andres, can you comment on this thread? I see you had a commit to PG
15 in this area:

I'll try to look at it soon. I'm only now catching up on email after being on
vacation.

Regards,

Andres

#9Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#5)
Re: BUG #18046: stats collection behaviour change is affecting the usability of information.

Hi,

On 2023-09-06 13:06:35 -0400, Bruce Momjian wrote:

(This email has Andres properly in the TO field.)

Andres, can you comment on this thread? I see you had a commit to PG
15 in this area:

commit 5cd1c40b3c
Author: Andres Freund <andres@anarazel.de>
Date: Thu Apr 14 17:40:25 2022 -0700

pgstat: set timestamps of fixed-numbered stats after a crash.

When not loading stats at startup (i.e. pgstat_discard_stats() getting
called), reset timestamps of fixed numbered stats would be left at
0. Oversight in 5891c7a8ed8.

Instead use pgstat_reset_after_failure() and add tests verifying that
fixed-numbered reset timestamps are set appropriately.

Reported-By: "David G. Johnston" <david.g.johnston@gmail.com>
Discussion: /messages/by-id/CAKFQuwamFuaQHKdhcMt4Gbw5+Hca2UE741B8gOOXoA=TtAd2Yw@mail.gmail.com

Thanks.

I don't think that commit is relevant, but I still am to blame :)

On Fri, Aug 4, 2023 at 10:40:46PM +0530, Jobin Augustine wrote:

Thank you Hamid for working on this and coming with a fix.
Attach is a fix for PG16 and PG15 that resolves this issue. It ensures that
when the database stats are being written to disk and the stats_reset is
not set, it adds the current timestamp to it. Since a new file is written
at initdb and when the server is recovering from a crash, this works as
expected.

I don't think that's the right approach. This way we'll end up with a stats
reset time that can be *after* other stats have already accumulated. Which
doesn't seem great.

I think Horiguchi-san's approach is closer to what we would want.

Unfortunately, the reset time for other variable-numbered stats entries
historically has behaved differently than what you're
desiring. E.g. pg_stat_replication_slots.stats_reset isn't set unless you have
reset the stats.

I'm ok with changing the semantics to one behaviour, but I'm a bit hesitant
whacking this around further in a minor release...

Greetings,

Andres Freund

#10Jobin Augustine
jobinau@gmail.com
In reply to: Andres Freund (#9)
Re: BUG #18046: stats collection behaviour change is affecting the usability of information.

I think Horiguchi-san's approach is closer to what we would want.

Great.

Unfortunately, the reset time for other variable-numbered stats entries
historically has behaved differently than what you're
desiring. E.g. pg_stat_replication_slots.stats_reset isn't set unless you
have
reset the stats.

Conceptually, that's correct. But the Impact on users are different.
SQL statements around pg_stat_database are something every user (DBA) has
in their arsenal for essential checks and troubleshooting.
The impact is big when their queries stopped working/giving wrong results
from PostgeSQL 15 onwards.

I'm ok with changing the semantics to one behaviour,

Great. I believe everyone will appreciate a consistent behaviour

but I'm a bit hesitant
whacking this around further in a minor release...

Since this sudden undocumented behaviour change caught everyone by
surprise,
and case this is a too big change for a "bug-fix", is it possible to have
a temporary fix for a minor version, fixing pg_stat_database alone?

Greetings,
Andres Freund

Regards,
Jobin Augustine