pb with big volumes

Started by Marc Millasover 2 years ago9 messagesgeneral
Jump to latest
#1Marc Millas
marc.millas@mokadb.com

Hi,

I have a 15 TB db on postgres 14 (soon 15).
shared buffers is 32 GB.

It's a db with max 15 users and often less, and currently 1 or 2.
the biggest table have 133 partitions of 150M to 200M+ rows each.
lots of request access explicitly one of those.

When I, alone, run a query "reading" 15M buffers, it takes 20 minutes
(+-5minutes). inside the query there are 2 group by on a 200M rows
partition, with all the rows in each group by.
When a colleague run the same kind of request (not the same request, but
something reading roughly the same volume ) , on a different set of data,
his request is completed in less than half an hour.
If we run our requests simultaneously... my request take hours. around 3
hours.

I am making a supposition that its some kind of "pumping" effect in the
cache.

I cannot have access to the underlying OS. I can, for sure, do some copy xx
from program 'some command', but its a container with very limited
possibilities, not even 'ps'.
So I would like to monitor from inside the db (so without iostat and the
same) the volumes of read that postgres do to the OS.
I did activate track_io_timing, but the volumes I get in the explain
analyze buffer are roughly the same alone or not alone. (the 15M buffers
told )
to my understanding, the volumes that are shown in pg_stat_database are the
useful ones ie. even if the db as to read it from disk more than once. true
? or false ?

So.. either my supposition is not correct, and I will read with a lot of
interest other ideas
either its correct and I would like to know how to monitor this (in the
current context, installing a dedicated extension is not impossible, but is
a very boring process)

Thanks for your help :-)

regards,

PS: I know that providing the complete data model and the exact requests
can be considered mandatory, but when I change the request I get the very
same behaviour...

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

#2Ron
ronljohnsonjr@gmail.com
In reply to: Marc Millas (#1)
Re: pb with big volumes

On 8/10/23 16:36, Marc Millas wrote:

Hi,

I have a 15 TB db on postgres 14 (soon 15).
shared buffers is 32 GB.

Does the system have 128GB AM?

It's a db with max 15 users and often less, and currently 1 or 2.
the biggest table have 133 partitions of 150M to 200M+ rows each.
lots of request access explicitly one of those.

When I, alone, run a query "reading" 15M buffers, it takes 20 minutes
(+-5minutes). inside the query there are  2 group by on a 200M rows
partition, with all the rows in each group by.
When a colleague run the same kind of request (not the same request, but
something reading roughly the same volume ) , on a different set of data, 
 his request is completed in less than half an hour.
If we run our requests simultaneously... my request take hours. around 3
hours.

I am making a supposition that its some kind of "pumping" effect in the
cache.

What is work_mem set to?

When were the tables last vacuumed and analyzed?

Good index support?  (But that might not matter if every row in the table is
in the GROUP BY.)

Clustering the tables, and using BRIN indices might help.

I cannot have access to the underlying OS. I can, for sure, do some copy
xx from program 'some command',  but its a container with very limited
possibilities, not even 'ps'.
So I would like to monitor from inside the db (so without iostat and the
same)  the volumes of read that postgres do to the OS.
I did activate track_io_timing, but the volumes I get in the explain
analyze buffer are roughly the same alone or not alone. (the 15M buffers
told )
to my understanding, the volumes that are shown in pg_stat_database are
the useful ones ie. even if the db as to read it from disk more than once.
true ? or false ?

So.. either my supposition is not correct, and I will read with a lot of
interest other ideas
either its correct and I would like to know how to monitor this (in the
current context, installing a dedicated extension is not impossible, but
is a very boring process)

Thanks for your help :-)

regards,

PS: I know that providing the complete data model and the exact requests
can be considered mandatory, but when I change the request I get the very
same behaviour...

--
Born in Arizona, moved to Babylonia.

#3Adam Scott
adam.c.scott@gmail.com
In reply to: Marc Millas (#1)
Re: pb with big volumes

I think your concern is that 20 min + 30 min does not equal 3 hours.

It might be natural to think the contention would, at max, be 50 min x 2 (1
hr 40 min).

So what's going on?

It seems disk I/O is a primary suspect since you hint for an iostat
replacement inside of Postgres.

If it is due to disk I/O the resolution will be to add RAID 0 SSDs at best.

Consider looking at io stats on the container's persistent volumes.

What is the pipe connecting the database server to the disks? If it's NAS,
well that would explain it.

HTH,
Adam

On Thu, Aug 10, 2023 at 2:37 PM Marc Millas <marc.millas@mokadb.com> wrote:

Show quoted text

Hi,

I have a 15 TB db on postgres 14 (soon 15).
shared buffers is 32 GB.

It's a db with max 15 users and often less, and currently 1 or 2.
the biggest table have 133 partitions of 150M to 200M+ rows each.
lots of request access explicitly one of those.

When I, alone, run a query "reading" 15M buffers, it takes 20 minutes
(+-5minutes). inside the query there are 2 group by on a 200M rows
partition, with all the rows in each group by.
When a colleague run the same kind of request (not the same request, but
something reading roughly the same volume ) , on a different set of data,
his request is completed in less than half an hour.
If we run our requests simultaneously... my request take hours. around 3
hours.

I am making a supposition that its some kind of "pumping" effect in the
cache.

I cannot have access to the underlying OS. I can, for sure, do some copy
xx from program 'some command', but its a container with very limited
possibilities, not even 'ps'.
So I would like to monitor from inside the db (so without iostat and the
same) the volumes of read that postgres do to the OS.
I did activate track_io_timing, but the volumes I get in the explain
analyze buffer are roughly the same alone or not alone. (the 15M buffers
told )
to my understanding, the volumes that are shown in pg_stat_database are
the useful ones ie. even if the db as to read it from disk more than once.
true ? or false ?

So.. either my supposition is not correct, and I will read with a lot of
interest other ideas
either its correct and I would like to know how to monitor this (in the
current context, installing a dedicated extension is not impossible, but is
a very boring process)

Thanks for your help :-)

regards,

PS: I know that providing the complete data model and the exact requests
can be considered mandatory, but when I change the request I get the very
same behaviour...

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

#4Ron
ronljohnsonjr@gmail.com
In reply to: Adam Scott (#3)
Re: pb with big volumes

Wouldn't IO contention make for additive timings instead of exponential?

On 8/10/23 20:41, Adam Scott wrote:

I think your concern is that 20 min + 30 min does not equal 3 hours.

It might be natural to think the contention would, at max, be 50 min x 2
(1 hr 40 min).

So what's going on?

It seems disk I/O  is a primary suspect since you hint for an iostat
replacement inside of Postgres.

If it is due to disk I/O the resolution will be to add RAID 0 SSDs at best.

Consider looking at io stats on the container's persistent volumes.

What is the pipe connecting the database server to the disks?  If it's
NAS, well that would explain it.

HTH,
Adam

On Thu, Aug 10, 2023 at 2:37 PM Marc Millas <marc.millas@mokadb.com> wrote:

Hi,

I have a 15 TB db on postgres 14 (soon 15).
shared buffers is 32 GB.

It's a db with max 15 users and often less, and currently 1 or 2.
the biggest table have 133 partitions of 150M to 200M+ rows each.
lots of request access explicitly one of those.

When I, alone, run a query "reading" 15M buffers, it takes 20 minutes
(+-5minutes). inside the query there are 2 group by on a 200M rows
partition, with all the rows in each group by.
When a colleague run the same kind of request (not the same request,
but something reading roughly the same volume ) , on a different set
of data,   his request is completed in less than half an hour.
If we run our requests simultaneously... my request take hours. around
3 hours.

I am making a supposition that its some kind of "pumping" effect in
the cache.

I cannot have access to the underlying OS. I can, for sure, do some
copy xx from program 'some command',  but its a container with very
limited possibilities, not even 'ps'.
So I would like to monitor from inside the db (so without iostat and
the same)  the volumes of read that postgres do to the OS.
I did activate track_io_timing, but the volumes I get in the explain
analyze buffer are roughly the same alone or not alone. (the 15M
buffers told )
to my understanding, the volumes that are shown in pg_stat_database
are the useful ones ie. even if the db as to read it from disk more
than once. true ? or false ?

So.. either my supposition is not correct, and I will read with a lot
of interest other ideas
either its correct and I would like to know how to monitor this (in
the current context, installing a dedicated extension is not
impossible, but is a very boring process)

Thanks for your help :-)

regards,

PS: I know that providing the complete data model and the exact
requests can be considered mandatory, but when I change the request I
get the very same behaviour...

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com <http://www.mokadb.com&gt;

--
Born in Arizona, moved to Babylonia.

#5David Rowley
dgrowleyml@gmail.com
In reply to: Ron (#4)
Re: pb with big volumes

On Fri, 11 Aug 2023 at 13:54, Ron <ronljohnsonjr@gmail.com> wrote:

Wouldn't IO contention make for additive timings instead of exponential?

No, not necessarily. Imagine one query running that's doing a
parameterised nested loop join resulting in the index on the inner
side being descended several, say, million times. Let's say there's
*just* enough RAM/shared buffers so that the index pages, once the
index is scanned the first time, all the required pages are cached
which results in no I/O on subsequent index scans. Now, imagine
another similar query but with another index, let's say this index
also *just* fits in cache. Now, when these two queries run
concurrently, they each evict buffers the other one uses. Of course,
the shared buffers code is written in such a way as to try and evict
lesser used buffers first, but if they're all used about the same
amount, then this can stuff occur. The slowdown isn't linear.

I've no idea if this is happening for the reported case. I'm just
saying that it can happen. The OP should really post the results of:
SET track_io_timing = ON; EXPLAIN (ANALYZE, BUFFERS) for both queries
running independently then again when they run concurrently.

David
David

#6Marc Millas
marc.millas@mokadb.com
In reply to: Ron (#4)
Re: pb with big volumes

No if there is a pumping effect ie. Loading data in the cache flushed by
the other request and again and again.
I have had this on an Oracle db years ago. Iostat 100% for days.
Managed by partitionning and locking some data in the cache

Le ven. 11 août 2023 à 03:54, Ron <ronljohnsonjr@gmail.com> a écrit :

Show quoted text

Wouldn't IO contention make for additive timings instead of exponential?

On 8/10/23 20:41, Adam Scott wrote:

I think your concern is that 20 min + 30 min does not equal 3 hours.

It might be natural to think the contention would, at max, be 50 min x 2
(1 hr 40 min).

So what's going on?

It seems disk I/O is a primary suspect since you hint for an iostat
replacement inside of Postgres.

If it is due to disk I/O the resolution will be to add RAID 0 SSDs at best.

Consider looking at io stats on the container's persistent volumes.

What is the pipe connecting the database server to the disks? If it's
NAS, well that would explain it.

HTH,
Adam

On Thu, Aug 10, 2023 at 2:37 PM Marc Millas <marc.millas@mokadb.com>
wrote:

Hi,

I have a 15 TB db on postgres 14 (soon 15).
shared buffers is 32 GB.

It's a db with max 15 users and often less, and currently 1 or 2.
the biggest table have 133 partitions of 150M to 200M+ rows each.
lots of request access explicitly one of those.

When I, alone, run a query "reading" 15M buffers, it takes 20 minutes
(+-5minutes). inside the query there are 2 group by on a 200M rows
partition, with all the rows in each group by.
When a colleague run the same kind of request (not the same request, but
something reading roughly the same volume ) , on a different set of data,
his request is completed in less than half an hour.
If we run our requests simultaneously... my request take hours. around 3
hours.

I am making a supposition that its some kind of "pumping" effect in the
cache.

I cannot have access to the underlying OS. I can, for sure, do some copy
xx from program 'some command', but its a container with very limited
possibilities, not even 'ps'.
So I would like to monitor from inside the db (so without iostat and the
same) the volumes of read that postgres do to the OS.
I did activate track_io_timing, but the volumes I get in the explain
analyze buffer are roughly the same alone or not alone. (the 15M buffers
told )
to my understanding, the volumes that are shown in pg_stat_database are
the useful ones ie. even if the db as to read it from disk more than once.
true ? or false ?

So.. either my supposition is not correct, and I will read with a lot of
interest other ideas
either its correct and I would like to know how to monitor this (in the
current context, installing a dedicated extension is not impossible, but is
a very boring process)

Thanks for your help :-)

regards,

PS: I know that providing the complete data model and the exact requests
can be considered mandatory, but when I change the request I get the very
same behaviour...

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

--
Born in Arizona, moved to Babylonia.

#7Ron
ronljohnsonjr@gmail.com
In reply to: David Rowley (#5)
Re: pb with big volumes

On 8/10/23 23:40, David Rowley wrote:

On Fri, 11 Aug 2023 at 13:54, Ron <ronljohnsonjr@gmail.com> wrote:

Wouldn't IO contention make for additive timings instead of exponential?

No, not necessarily. Imagine one query running that's doing a
parameterised nested loop join resulting in the index on the inner
side being descended several, say, million times. Let's say there's
*just* enough RAM/shared buffers so that the index pages, once the
index is scanned the first time, all the required pages are cached
which results in no I/O on subsequent index scans. Now, imagine
another similar query but with another index, let's say this index
also *just* fits in cache. Now, when these two queries run
concurrently, they each evict buffers the other one uses. Of course,
the shared buffers code is written in such a way as to try and evict
lesser used buffers first, but if they're all used about the same
amount, then this can stuff occur. The slowdown isn't linear.

But that's cache thrashing (which was OP's concern), not IO contention.

--
Born in Arizona, moved to Babylonia.

#8Marc Millas
marc.millas@mokadb.com
In reply to: David Rowley (#5)
Re: pb with big volumes

Hi David,

that's exactly my question.
does the analyze buffers data, generated when track_io_timing is on, keep
track of multiple reloads of the same data while executing one operation ?

I ll do the test asap and report the results.

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

On Fri, Aug 11, 2023 at 6:41 AM David Rowley <dgrowleyml@gmail.com> wrote:

Show quoted text

On Fri, 11 Aug 2023 at 13:54, Ron <ronljohnsonjr@gmail.com> wrote:

Wouldn't IO contention make for additive timings instead of exponential?

No, not necessarily. Imagine one query running that's doing a
parameterised nested loop join resulting in the index on the inner
side being descended several, say, million times. Let's say there's
*just* enough RAM/shared buffers so that the index pages, once the
index is scanned the first time, all the required pages are cached
which results in no I/O on subsequent index scans. Now, imagine
another similar query but with another index, let's say this index
also *just* fits in cache. Now, when these two queries run
concurrently, they each evict buffers the other one uses. Of course,
the shared buffers code is written in such a way as to try and evict
lesser used buffers first, but if they're all used about the same
amount, then this can stuff occur. The slowdown isn't linear.

I've no idea if this is happening for the reported case. I'm just
saying that it can happen. The OP should really post the results of:
SET track_io_timing = ON; EXPLAIN (ANALYZE, BUFFERS) for both queries
running independently then again when they run concurrently.

David
David

#9David Rowley
dgrowleyml@gmail.com
In reply to: Marc Millas (#8)
Re: pb with big volumes

On Mon, 14 Aug 2023 at 11:14, Marc Millas <marc.millas@mokadb.com> wrote:

that's exactly my question.
does the analyze buffers data, generated when track_io_timing is on, keep track of multiple reloads of the same data while executing one operation ?

Yes, the timing for reads will include the time it took to fetch any
buffer that wasn't found in shared buffers. Some of those may come
quickly from the kernel's page cache, some might come from disk. If
some other running query has evicted a buffer that the query has
previously used, then that's going to cause another pread, which will
be timed by track_io_timing and added to the count of buffers read in
the "BUFFERS" EXPLAIN output.

So, the BUFFERs EXPLAIN option showing similar amounts of reads
between the query running without the concurrent query and with the
concurrent query does not necessarily mean more buffers had to be
loaded from disk, just that fewer were found in shared buffers. The
amount of time doing I/O as shown by track_io_timing is going to be
more interesting as that's really the only indication from within
PostgreSQL that you have to get an idea of if the buffers are coming
from the kernel's cache or from disk. You'll probably want to
calculate the average time it took to get 1 buffer for each query to
make sense of that.

David