Tracking last scan time

Started by Dave Pageover 3 years ago43 messageshackers
Jump to latest
#1Dave Page
dpage@pgadmin.org

Often it is beneficial to review one's schema with a view to removing
indexes (and sometimes tables) that are no longer required. It's very
difficult to understand when that is the case by looking at the number of
scans of a relation as, for example, an index may be used infrequently but
may be critical in those times when it is used.

The attached patch against HEAD adds optional tracking of the last scan
time for relations. It updates pg_stat_*_tables with new last_seq_scan and
last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to
help with this.

Due to the use of gettimeofday(), those values are only maintained if a new
GUC, track_scans, is set to on. By default, it is off.

I did run a 12 hour test to see what the performance impact is. pgbench was
run with scale factor 10000 and 75 users across 4 identical bare metal
machines running Rocky 8 in parallel which showed roughly a -2% average
performance penalty against HEAD with track_scans enabled. Machines were
PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data
directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source
is tsc.

HEAD track_scans Penalty (%)
box1 19582.49735 19341.8881 -1.22869541
box2 19936.55513 19928.07479 -0.04253664659
box3 19631.78895 18649.64379 -5.002830696
box4 19810.86767 19420.67192 -1.969604525
Average 19740.42728 19335.06965 -2.05343896

Doc and test updates included.

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

Attachments:

last_scan_v1.diffapplication/octet-stream; name=last_scan_v1.diffDownload+115-1
#2Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#1)
Re: Tracking last scan time

On Tue, 23 Aug 2022 at 11:00, Dave Page <dpage@pgadmin.org> wrote:

Often it is beneficial to review one's schema with a view to removing indexes (and sometimes tables) that are no longer required. It's very difficult to understand when that is the case by looking at the number of scans of a relation as, for example, an index may be used infrequently but may be critical in those times when it is used.

I think this is easy to answer in a prometheus/datadog/etc world since
you can consult the history of the count to see when it was last
incremented. (Or do effectively that continously).

I guess that just reinforces the idea that it should be optional.
Perhaps there's room for some sort of general feature for controlling
various time series aggregates like max() and min() sum() or, uhm,
timeoflastchange() on whatever stats you want. That would let us
remove a bunch of stuff from pg_stat_statements and let users turn on
just the ones they want. And also let users enable things like time of
last rollback or conflict etc. But that's just something to think
about down the road.

The attached patch against HEAD adds optional tracking of the last scan time for relations. It updates pg_stat_*_tables with new last_seq_scan and last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to help with this.

Due to the use of gettimeofday(), those values are only maintained if a new GUC, track_scans, is set to on. By default, it is off.

Bikeshedding warning -- "track_scans" could equally apply to almost
any stats about scans. I think the really relevant thing here is the
times, not the scans. I think the GUC should be "track_scan_times". Or
could that still be confused with scan durations? Maybe
"track_scan_timestamps"?

You could maybe make the gettimeofday cheaper by doing it less often.
Like, skipping the increment if the old timestamp is newer than 1s
before the transaction start time (I think that's available free if
some other guc is enabled but I don't recall). Or isn't this cb
normally happening after transaction end? So xactStopTimestamp might
be available already?

--
greg

#3Dave Page
dpage@pgadmin.org
In reply to: Bruce Momjian (#2)
Re: Tracking last scan time

Hi

On Tue, 23 Aug 2022 at 13:07, Greg Stark <stark@mit.edu> wrote:

On Tue, 23 Aug 2022 at 11:00, Dave Page <dpage@pgadmin.org> wrote:

Often it is beneficial to review one's schema with a view to removing

indexes (and sometimes tables) that are no longer required. It's very
difficult to understand when that is the case by looking at the number of
scans of a relation as, for example, an index may be used infrequently but
may be critical in those times when it is used.

I think this is easy to answer in a prometheus/datadog/etc world since
you can consult the history of the count to see when it was last
incremented. (Or do effectively that continously).

Yes. But not every PostgreSQL instance is monitored in that way.

I guess that just reinforces the idea that it should be optional.
Perhaps there's room for some sort of general feature for controlling
various time series aggregates like max() and min() sum() or, uhm,
timeoflastchange() on whatever stats you want. That would let us
remove a bunch of stuff from pg_stat_statements and let users turn on
just the ones they want. And also let users enable things like time of
last rollback or conflict etc. But that's just something to think
about down the road.

It's certainly an interesting idea.

The attached patch against HEAD adds optional tracking of the last scan

time for relations. It updates pg_stat_*_tables with new last_seq_scan and
last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to
help with this.

Due to the use of gettimeofday(), those values are only maintained if a

new GUC, track_scans, is set to on. By default, it is off.

Bikeshedding warning -- "track_scans" could equally apply to almost
any stats about scans. I think the really relevant thing here is the
times, not the scans. I think the GUC should be "track_scan_times". Or
could that still be confused with scan durations? Maybe
"track_scan_timestamps"?

The latter seems reasonable.

You could maybe make the gettimeofday cheaper by doing it less often.
Like, skipping the increment if the old timestamp is newer than 1s
before the transaction start time (I think that's available free if
some other guc is enabled but I don't recall). Or isn't this cb
normally happening after transaction end? So xactStopTimestamp might
be available already?

Something like:

if (pgstat_track_scan_timestamps && lstats->t_counts.t_numscans &&
tabentry->lastscan + USECS_PER_SEC <
GetCurrentTransactionStopTimestamp())
tabentry->lastscan = GetCurrentTimestamp();

?

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

#4Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#1)
Re: Tracking last scan time

On Tue, Aug 23, 2022 at 10:55:09AM +0100, Dave Page wrote:

Often it is beneficial to review one's schema with a view to removing indexes
(and sometimes tables) that are no longer required. It's very difficult to
understand when that is the case by looking at the number of scans of a
relation as, for example, an index may be used infrequently but may be critical
in those times when it is used.

The attached patch against HEAD adds optional tracking of the last scan time
for relations. It updates pg_stat_*_tables with new last_seq_scan and
last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to
help with this.

Would it be simpler to allow the sequential and index scan columns to be
cleared so you can look later to see if it is non-zero? Should we allow
arbitrary clearing of stat columns?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#5Dave Page
dpage@pgadmin.org
In reply to: Bruce Momjian (#4)
Re: Tracking last scan time

On Wed, 24 Aug 2022 at 15:18, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, Aug 23, 2022 at 10:55:09AM +0100, Dave Page wrote:

Often it is beneficial to review one's schema with a view to removing

indexes

(and sometimes tables) that are no longer required. It's very difficult

to

understand when that is the case by looking at the number of scans of a
relation as, for example, an index may be used infrequently but may be

critical

in those times when it is used.

The attached patch against HEAD adds optional tracking of the last scan

time

for relations. It updates pg_stat_*_tables with new last_seq_scan and
last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column

to

help with this.

Would it be simpler to allow the sequential and index scan columns to be
cleared so you can look later to see if it is non-zero? Should we allow
arbitrary clearing of stat columns?

I don't think so, because then stat values wouldn't necessarily correlate
with each other, and you wouldn't know when any of them were last reset
unless we started tracking each individual reset. At least now you can see
when they were all reset, and you know they were reset at the same time.

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

#6Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#5)
Re: Tracking last scan time

On Wed, Aug 24, 2022 at 04:01:21PM +0100, Dave Page wrote:

On Wed, 24 Aug 2022 at 15:18, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, Aug 23, 2022 at 10:55:09AM +0100, Dave Page wrote:

Often it is beneficial to review one's schema with a view to removing

indexes

(and sometimes tables) that are no longer required. It's very difficult

to

understand when that is the case by looking at the number of scans of a
relation as, for example, an index may be used infrequently but may be

critical

in those times when it is used.

The attached patch against HEAD adds optional tracking of the last scan

time

for relations. It updates pg_stat_*_tables with new last_seq_scan and
last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column

to

help with this.

Would it be simpler to allow the sequential and index scan columns to be
cleared so you can look later to see if it is non-zero?  Should we allow

I don't think so, because then stat values wouldn't necessarily correlate with
each other, and you wouldn't know when any of them were last reset unless we
started tracking each individual reset. At least now you can see when they were
all reset, and you know they were reset at the same time.

Yeah, true. I was more asking if these two columns are in some way
special or if people would want a more general solution, and if so, is
that something we want in core Postgres.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#7Dave Page
dpage@pgadmin.org
In reply to: Bruce Momjian (#6)
Re: Tracking last scan time

On Wed, 24 Aug 2022 at 16:03, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 24, 2022 at 04:01:21PM +0100, Dave Page wrote:

On Wed, 24 Aug 2022 at 15:18, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, Aug 23, 2022 at 10:55:09AM +0100, Dave Page wrote:

Often it is beneficial to review one's schema with a view to

removing

indexes

(and sometimes tables) that are no longer required. It's very

difficult

to

understand when that is the case by looking at the number of scans

of a

relation as, for example, an index may be used infrequently but

may be

critical

in those times when it is used.

The attached patch against HEAD adds optional tracking of the last

scan

time

for relations. It updates pg_stat_*_tables with new last_seq_scan

and

last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan

column

to

help with this.

Would it be simpler to allow the sequential and index scan columns

to be

cleared so you can look later to see if it is non-zero? Should we

allow

I don't think so, because then stat values wouldn't necessarily

correlate with

each other, and you wouldn't know when any of them were last reset

unless we

started tracking each individual reset. At least now you can see when

they were

all reset, and you know they were reset at the same time.

Yeah, true. I was more asking if these two columns are in some way
special or if people would want a more general solution, and if so, is
that something we want in core Postgres.

They're special in the sense that they're the ones you're most likely going
to look at to see how much a relation is used I think (at least, I'd look
at them rather than the tuple counts).

There are certainly other things for which a last usage value may be
useful. Functions/procedures for example, or views. The benefits to
removing unused objects of that type are far, far lower than indexes or
tables of course.

There are other potential use cases for similar timestamps, such as object
creation times (and creating user), but they are more useful for auditing
than monitoring and optimisation.

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

#8David Rowley
dgrowleyml@gmail.com
In reply to: Bruce Momjian (#6)
Re: Tracking last scan time

On Thu, 25 Aug 2022 at 03:03, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 24, 2022 at 04:01:21PM +0100, Dave Page wrote:

On Wed, 24 Aug 2022 at 15:18, Bruce Momjian <bruce@momjian.us> wrote:
Would it be simpler to allow the sequential and index scan columns to be
cleared so you can look later to see if it is non-zero? Should we allow

I don't think so, because then stat values wouldn't necessarily correlate with
each other, and you wouldn't know when any of them were last reset unless we
started tracking each individual reset. At least now you can see when they were
all reset, and you know they were reset at the same time.

Yeah, true. I was more asking if these two columns are in some way
special or if people would want a more general solution, and if so, is
that something we want in core Postgres.

Back when I used to do a bit of PostgreSQL DBA stuff, I had a nightly
job setup to record the state of pg_stat_all_tables and put that into
another table along with the current date. I then had a view that did
some calculations with col - LAG(col) OVER (PARTITION BY relid ORDER
BY date) to fetch the numerical values for each date. I didn't ever
want to reset the stats because it messes with autovacuum. If you zero
out n_ins_since_vacuum more often than auto-vacuum would trigger, then
bad things happen over time (we should really warn about that in the
docs).

I don't have a particular opinion about the patch, I'm just pointing
out that there are other ways. Even just writing down the numbers on a
post-it note and coming back in a month to see if they've changed is
enough to tell if the table or index has been used.

We do also need to consider now that stats are stored in shared memory
that any fields we add are in RAM.

David

#9Dave Page
dpage@pgadmin.org
In reply to: David Rowley (#8)
Re: Tracking last scan time

Hi

On Thu, 25 Aug 2022 at 01:44, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 25 Aug 2022 at 03:03, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 24, 2022 at 04:01:21PM +0100, Dave Page wrote:

On Wed, 24 Aug 2022 at 15:18, Bruce Momjian <bruce@momjian.us> wrote:
Would it be simpler to allow the sequential and index scan columns

to be

cleared so you can look later to see if it is non-zero? Should we

allow

I don't think so, because then stat values wouldn't necessarily

correlate with

each other, and you wouldn't know when any of them were last reset

unless we

started tracking each individual reset. At least now you can see when

they were

all reset, and you know they were reset at the same time.

Yeah, true. I was more asking if these two columns are in some way
special or if people would want a more general solution, and if so, is
that something we want in core Postgres.

Back when I used to do a bit of PostgreSQL DBA stuff, I had a nightly
job setup to record the state of pg_stat_all_tables and put that into
another table along with the current date. I then had a view that did
some calculations with col - LAG(col) OVER (PARTITION BY relid ORDER
BY date) to fetch the numerical values for each date. I didn't ever
want to reset the stats because it messes with autovacuum. If you zero
out n_ins_since_vacuum more often than auto-vacuum would trigger, then
bad things happen over time (we should really warn about that in the
docs).

I don't have a particular opinion about the patch, I'm just pointing
out that there are other ways. Even just writing down the numbers on a
post-it note and coming back in a month to see if they've changed is
enough to tell if the table or index has been used.

There are usually other ways to perform monitoring tasks, but there is
something to be said for the convenience of having functionality built in
and not having to rely on tools, scripts, or post-it notes :-)

We do also need to consider now that stats are stored in shared memory
that any fields we add are in RAM.

That is a fair point. I believe this is both minimal, and useful though.

I've attached a v2 patch that incorporates Greg's suggestions.

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

Attachments:

last_scan_v2.diffapplication/octet-stream; name=last_scan_v2.diffDownload+116-1
#10Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#9)
Re: Tracking last scan time

On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote:

On Thu, 25 Aug 2022 at 01:44, David Rowley <dgrowleyml@gmail.com> wrote:
I don't have a particular opinion about the patch, I'm just pointing
out that there are other ways. Even just writing down the numbers on a
post-it note and coming back in a month to see if they've changed is
enough to tell if the table or index has been used.

There are usually other ways to perform monitoring tasks, but there is
something to be said for the convenience of having functionality built in and
not having to rely on tools, scripts, or post-it notes :-)

Should we consider using something cheaper like time() so we don't need
a GUC to enable this?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#11Dave Page
dpage@pgadmin.org
In reply to: Bruce Momjian (#10)
Re: Tracking last scan time

On Tue, 30 Aug 2022 at 19:46, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote:

On Thu, 25 Aug 2022 at 01:44, David Rowley <dgrowleyml@gmail.com> wrote:
I don't have a particular opinion about the patch, I'm just pointing
out that there are other ways. Even just writing down the numbers on

a

post-it note and coming back in a month to see if they've changed is
enough to tell if the table or index has been used.

There are usually other ways to perform monitoring tasks, but there is
something to be said for the convenience of having functionality built

in and

not having to rely on tools, scripts, or post-it notes :-)

Should we consider using something cheaper like time() so we don't need
a GUC to enable this?

Interesting idea, but on my mac at least, 100,000,000 gettimeofday() calls
takes about 2 seconds, whilst 100,000,000 time() calls takes 14(!) seconds.

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

#12Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#11)
Re: Tracking last scan time

On Wed, Aug 31, 2022 at 05:02:33PM +0100, Dave Page wrote:

On Tue, 30 Aug 2022 at 19:46, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote:

On Thu, 25 Aug 2022 at 01:44, David Rowley <dgrowleyml@gmail.com> wrote:
     I don't have a particular opinion about the patch, I'm just pointing
     out that there are other ways. Even just writing down the numbers on

a

     post-it note and coming back in a month to see if they've changed is
     enough to tell if the table or index has been used.

There are usually other ways to perform monitoring tasks, but there is
something to be said for the convenience of having functionality built in

and

not having to rely on tools, scripts, or post-it notes :-)

Should we consider using something cheaper like time() so we don't need
a GUC to enable this?

Interesting idea, but on my mac at least, 100,000,000 gettimeofday() calls
takes about 2 seconds, whilst 100,000,000 time() calls takes 14(!) seconds.

Wow. I was just thinking you need second-level accuracy, which must be
cheap somewhere.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#13Andres Freund
andres@anarazel.de
In reply to: Dave Page (#1)
Re: Tracking last scan time

Hi,

On 2022-08-23 10:55:09 +0100, Dave Page wrote:

Often it is beneficial to review one's schema with a view to removing
indexes (and sometimes tables) that are no longer required. It's very
difficult to understand when that is the case by looking at the number of
scans of a relation as, for example, an index may be used infrequently but
may be critical in those times when it is used.

The attached patch against HEAD adds optional tracking of the last scan
time for relations. It updates pg_stat_*_tables with new last_seq_scan and
last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to
help with this.

Due to the use of gettimeofday(), those values are only maintained if a new
GUC, track_scans, is set to on. By default, it is off.

I did run a 12 hour test to see what the performance impact is. pgbench was
run with scale factor 10000 and 75 users across 4 identical bare metal
machines running Rocky 8 in parallel which showed roughly a -2% average
performance penalty against HEAD with track_scans enabled. Machines were
PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data
directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source
is tsc.

HEAD track_scans Penalty (%)
box1 19582.49735 19341.8881 -1.22869541
box2 19936.55513 19928.07479 -0.04253664659
box3 19631.78895 18649.64379 -5.002830696
box4 19810.86767 19420.67192 -1.969604525
Average 19740.42728 19335.06965 -2.05343896

Based on the size of those numbers this was a r/w pgbench. If it has this
noticable an impact for r/w, with a pretty low number of scans/sec, how's the
overhead for r/o (which can have 2 orders of magnitude more scans/sec)? It
must be quite bad.

I don't think we should accept this feature with this overhead - but I also
think we can do better, by accepting a bit less accuracy. For this to be
useful we don't need a perfectly accurate timestamp. The statement start time
is probably not accurate enough, but we could just have bgwriter or such
update one in shared memory every time we wake up? Or perhaps we could go to
an even lower granularity, by putting in the current LSN or such?

Greetings,

Andres Freund

#14Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#13)
Re: Tracking last scan time

On Wed, 31 Aug 2022 at 18:21, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-08-23 10:55:09 +0100, Dave Page wrote:

Often it is beneficial to review one's schema with a view to removing
indexes (and sometimes tables) that are no longer required. It's very
difficult to understand when that is the case by looking at the number of
scans of a relation as, for example, an index may be used infrequently but
may be critical in those times when it is used.

The attached patch against HEAD adds optional tracking of the last scan
time for relations. It updates pg_stat_*_tables with new last_seq_scan and
last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to
help with this.

Due to the use of gettimeofday(), those values are only maintained if a new
GUC, track_scans, is set to on. By default, it is off.

I did run a 12 hour test to see what the performance impact is. pgbench was
run with scale factor 10000 and 75 users across 4 identical bare metal
machines running Rocky 8 in parallel which showed roughly a -2% average
performance penalty against HEAD with track_scans enabled. Machines were
PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data
directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source
is tsc.

HEAD track_scans Penalty (%)
box1 19582.49735 19341.8881 -1.22869541
box2 19936.55513 19928.07479 -0.04253664659
box3 19631.78895 18649.64379 -5.002830696
box4 19810.86767 19420.67192 -1.969604525
Average 19740.42728 19335.06965 -2.05343896

Based on the size of those numbers this was a r/w pgbench. If it has this
noticable an impact for r/w, with a pretty low number of scans/sec, how's the
overhead for r/o (which can have 2 orders of magnitude more scans/sec)? It
must be quite bad.

I don't think we should accept this feature with this overhead - but I also
think we can do better, by accepting a bit less accuracy. For this to be
useful we don't need a perfectly accurate timestamp. The statement start time
is probably not accurate enough, but we could just have bgwriter or such
update one in shared memory every time we wake up? Or perhaps we could go to
an even lower granularity, by putting in the current LSN or such?

I don't think that LSN is precise enough. For example, if you're in a
(mostly) read-only system, the system may go long times without any
meaningful records being written.

As for having a lower granularity and preventing the
one-syscall-per-Relation issue, can't we reuse the query_start or
state_change timestamps that appear in pg_stat_activity (potentially
updated immediately before this stat flush), or some other per-backend
timestamp that is already maintained and considered accurate enough
for this use?
Regardless, with this patch as it is we get a new timestamp for each
relation processed, which I think is a waste of time (heh) even in
VDSO-enabled systems.

Apart from the above, I don't have any other meaningful opinion on
this patch - it might be a good addition, but I don't consume stats
often enough to make a good cost / benefit comparison.

Kind regards,

Matthias van de Meent

#15Bruce Momjian
bruce@momjian.us
In reply to: Matthias van de Meent (#14)
Re: Tracking last scan time

On Wed, Aug 31, 2022 at 07:52:49PM +0200, Matthias van de Meent wrote:

As for having a lower granularity and preventing the
one-syscall-per-Relation issue, can't we reuse the query_start or
state_change timestamps that appear in pg_stat_activity (potentially

Yeah, query start should be fine, but not transaction start time.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#16Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#14)
Re: Tracking last scan time

Hi,

On 2022-08-31 19:52:49 +0200, Matthias van de Meent wrote:

As for having a lower granularity and preventing the
one-syscall-per-Relation issue, can't we reuse the query_start or
state_change timestamps that appear in pg_stat_activity (potentially
updated immediately before this stat flush), or some other per-backend
timestamp that is already maintained and considered accurate enough
for this use?

The problem is that it won't change at all for a query that runs for a week -
and we'll report the timestamp from a week ago when it finally ends.

But given this is done when stats are flushed, which only happens after the
transaction ended, we can just use GetCurrentTransactionStopTimestamp() - if
we got to flushing the transaction stats we'll already have computed that.

tabentry->numscans += lstats->t_counts.t_numscans;
+	if (pgstat_track_scans && lstats->t_counts.t_numscans)
+		tabentry->lastscan = GetCurrentTimestamp();

Besides replacing GetCurrentTimestamp() with
GetCurrentTransactionStopTimestamp(), this should then also check if
tabentry->lastscan is already newer than the new timestamp.

Greetings,

Andres Freund

#17Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#16)
Re: Tracking last scan time

On Wed, Aug 31, 2022 at 11:56:29AM -0700, Andres Freund wrote:

Hi,

On 2022-08-31 19:52:49 +0200, Matthias van de Meent wrote:

As for having a lower granularity and preventing the
one-syscall-per-Relation issue, can't we reuse the query_start or
state_change timestamps that appear in pg_stat_activity (potentially
updated immediately before this stat flush), or some other per-backend
timestamp that is already maintained and considered accurate enough
for this use?

The problem is that it won't change at all for a query that runs for a week -
and we'll report the timestamp from a week ago when it finally ends.

But given this is done when stats are flushed, which only happens after the
transaction ended, we can just use GetCurrentTransactionStopTimestamp() - if
we got to flushing the transaction stats we'll already have computed that.

Oh, good point --- it is safer to show a more recent time than a too-old
time.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#18Dave Page
dpage@pgadmin.org
In reply to: Bruce Momjian (#12)
Re: Tracking last scan time

On Wed, 31 Aug 2022 at 17:13, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 31, 2022 at 05:02:33PM +0100, Dave Page wrote:

On Tue, 30 Aug 2022 at 19:46, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote:

On Thu, 25 Aug 2022 at 01:44, David Rowley <dgrowleyml@gmail.com>

wrote:

I don't have a particular opinion about the patch, I'm just

pointing

out that there are other ways. Even just writing down the

numbers on

a

post-it note and coming back in a month to see if they've

changed is

enough to tell if the table or index has been used.

There are usually other ways to perform monitoring tasks, but

there is

something to be said for the convenience of having functionality

built in

and

not having to rely on tools, scripts, or post-it notes :-)

Should we consider using something cheaper like time() so we don't

need

a GUC to enable this?

Interesting idea, but on my mac at least, 100,000,000 gettimeofday()

calls

takes about 2 seconds, whilst 100,000,000 time() calls takes 14(!)

seconds.

Wow. I was just thinking you need second-level accuracy, which must be
cheap somewhere.

Second-level accuracy would indeed be fine for this. Frankly, for my use
case just the date would be enough, but I can imagine people wanting
greater accuracy than that.

And yes, I was very surprised by the timing results I got as well. I guess
it's a quirk of macOS - on a Linux box I get ~4s for gettimeofday() and ~1s
for time().

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

#19Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#18)
Re: Tracking last scan time

On Thu, Sep 1, 2022 at 09:46:59AM +0100, Dave Page wrote:

On Wed, 31 Aug 2022 at 17:13, Bruce Momjian <bruce@momjian.us> wrote:
Wow.  I was just thinking you need second-level accuracy, which must be
cheap somewhere.

Second-level accuracy would indeed be fine for this. Frankly, for my use case
just the date would be enough, but I can imagine people wanting greater
accuracy than that. 

And yes, I was very surprised by the timing results I got as well. I guess it's
a quirk of macOS - on a Linux box I get ~4s for gettimeofday() and ~1s for time
().

i think we lose 95% of our users if we require it to be enabled so let's
work to find a way it can be always enabled.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#20Dave Page
dpage@pgadmin.org
In reply to: Bruce Momjian (#19)
Re: Tracking last scan time

On Thu, 1 Sept 2022 at 13:04, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Sep 1, 2022 at 09:46:59AM +0100, Dave Page wrote:

On Wed, 31 Aug 2022 at 17:13, Bruce Momjian <bruce@momjian.us> wrote:
Wow. I was just thinking you need second-level accuracy, which must

be

cheap somewhere.

Second-level accuracy would indeed be fine for this. Frankly, for my use

case

just the date would be enough, but I can imagine people wanting greater
accuracy than that.

And yes, I was very surprised by the timing results I got as well. I

guess it's

a quirk of macOS - on a Linux box I get ~4s for gettimeofday() and ~1s

for time

().

i think we lose 95% of our users if we require it to be enabled so let's
work to find a way it can be always enabled.

So based on Andres' suggestion, something like this seems like it might
work:

if (pgstat_track_scan_timestamps && lstats->t_counts.t_numscans)
{
TimestampTz t = GetCurrentTransactionStopTimestamp();
if (t > tabentry->lastscan)
tabentry->lastscan = t;
}

If that seems like a good option, I can run some more benchmarks (and then
remove the GUC if it looks good).

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

#21Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#16)
#22Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#21)
#23Dave Page
dpage@pgadmin.org
In reply to: Andres Freund (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Dave Page (#23)
#25Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#24)
#26Dave Page
dpage@pgadmin.org
In reply to: Andres Freund (#24)
#27Vik Fearing
vik@postgresfriends.org
In reply to: Dave Page (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Vik Fearing (#27)
#29Dave Page
dpage@pgadmin.org
In reply to: Andres Freund (#28)
#30Michael Paquier
michael@paquier.xyz
In reply to: Dave Page (#29)
#31Dave Page
dpage@pgadmin.org
In reply to: Michael Paquier (#30)
#32Michael Paquier
michael@paquier.xyz
In reply to: Dave Page (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#30)
#34Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#33)
#35Dave Page
dpage@pgadmin.org
In reply to: Andres Freund (#34)
#36Andres Freund
andres@anarazel.de
In reply to: Dave Page (#35)
#37Dave Page
dpage@pgadmin.org
In reply to: Andres Freund (#36)
#38Robert Treat
xzilla@users.sourceforge.net
In reply to: Dave Page (#37)
#39Dave Page
dpage@pgadmin.org
In reply to: Robert Treat (#38)
#40Dave Page
dpage@pgadmin.org
In reply to: Dave Page (#39)
#41Michael Paquier
michael@paquier.xyz
In reply to: Dave Page (#40)
#42Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#41)
#43Dave Page
dpage@pgadmin.org
In reply to: Michael Paquier (#42)