Keeping temporary tables in shared buffers
Hello
We are evaluating the use of shared buffers for temporary tables. The
advantage being queries involving temporary tables can make use of parallel
workers.
Challenges:
1. We lose the performance benefit of local buffers.
2. Additional complexity in shared buffer manager - BufferTag needs to
include backend ID to distinguish 'my' temp buffers from non-temp or
'others' temp buffers.
3. Checkpointer needs to skip them for fsync but bgwriter needs to write
them out.
Has someone already thought about this? Any workloads that might benefit
from this idea?
Asim and David
Asim Praveen <apraveen@pivotal.io> writes:
We are evaluating the use of shared buffers for temporary tables. The
advantage being queries involving temporary tables can make use of parallel
workers.
Hm...
Challenges:
1. We lose the performance benefit of local buffers.
Yeah. This would be an absolute dead loss for any situation where you
couldn't get major parallelism wins, which I'm afraid would be often.
So then you have to think about how to transition smoothly between "rel
is in local buffers" and "rel is in shared buffers", bearing in mind that
ever having the same page in two different buffers would be disastrous.
2. Additional complexity in shared buffer manager - BufferTag needs to
include backend ID to distinguish 'my' temp buffers from non-temp or
'others' temp buffers.
I think that would be a deal breaker right there, because of the
distributed overhead of making the tags bigger. However, I don't
actually understand why you would need to do that. Temp tables
have unique OIDs/relfilenodes anyway, don't they? Or if I'm
misremembering and they don't, couldn't we make them so?
3. Checkpointer needs to skip them for fsync but bgwriter needs to write
them out.
Not really sure why that would be a "must", either. If the checkpointer
performs some useless writes, that's a bit of a performance drag, but
taking a performance hit to avoid it could be a net loss. The only reason
why you'd care about writing at all is to try to make the buffers clean
in case they have to be reclaimed for some other use --- and if the
checkpointer does such a write instead of the bgwriter, why's that bad?
regards, tom lane
On Thu, May 24, 2018 at 8:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So then you have to think about how to transition smoothly between "rel
is in local buffers" and "rel is in shared buffers", bearing in mind that
ever having the same page in two different buffers would be disastrous.
Local buffers would not be used at all if temp tables start residing in
shared buffers. The transition mentioned above shouldn't be needed.
I think that would be a deal breaker right there, because of the
distributed overhead of making the tags bigger. However, I don't
actually understand why you would need to do that. Temp tables
have unique OIDs/relfilenodes anyway, don't they? Or if I'm
misremembering and they don't, couldn't we make them so?
My parochial vision of the overhead is restricted to 4 * NBuffers of
additional shared memory, as 4 bytes are being added to BufferTag. May I
please get some enlightenment?
Temp tables have unique filename on disk: t_<backendID>_<relfilenode>. The
logic to assign OIDs and relfilenodes, however, doesn't differ. Given a
RelFileNode, it is not possible to tell if it's a temp table or not.
RelFileNodeBackend allows for that distinction but it's not used by buffer
manager.
taking a performance hit to avoid it could be a net loss. The only reason
why you'd care about writing at all is to try to make the buffers clean
in case they have to be reclaimed for some other use --- and if the
checkpointer does such a write instead of the bgwriter, why's that bad?
Yes, a temp table's buffer would need to be written out only if it needs to
be repurposed for a different page. It is not bad, our description wasn't
clear enough.
Asim
On 25/05/18 09:25, Asim Praveen wrote:
On Thu, May 24, 2018 at 8:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So then you have to think about how to transition smoothly between "rel
is in local buffers" and "rel is in shared buffers", bearing in mind that
ever having the same page in two different buffers would be disastrous.Local buffers would not be used at all if temp tables start residing in
shared buffers. The transition mentioned above shouldn't be needed.
What is the performance difference between the local buffer manager and
the shared buffer manager? The local buffer manager avoids all the
locking overhead, which has to amount to something, but how big a
difference is it?
I think that would be a deal breaker right there, because of the
distributed overhead of making the tags bigger. However, I don't
actually understand why you would need to do that. Temp tables
have unique OIDs/relfilenodes anyway, don't they? Or if I'm
misremembering and they don't, couldn't we make them so?My parochial vision of the overhead is restricted to 4 * NBuffers of
additional shared memory, as 4 bytes are being added to BufferTag. May I
please get some enlightenment?
Any extra fields in BufferTag make computing the hash more expensive.
It's a very hot code path, so any cycles spent are significant.
In relation to Andres' patches to rewrite the buffer manager with a
radix tree, there was actually some discussion of trying to make
BufferTag *smaller*. For example, we could rearrange things so that
pg_class.relfilenode is 64 bits wide. Then you could assume that it
never wraps around, and is unique across all relations in the cluster.
Then you could replace the 12-byte relfilenode+dbid+spcid triplet, with
just the 8-byte relfilenode. Doing something like that might be the
solution here, too.
Temp tables have unique filename on disk: t_<backendID>_<relfilenode>. The
logic to assign OIDs and relfilenodes, however, doesn't differ. Given a
RelFileNode, it is not possible to tell if it's a temp table or not.
RelFileNodeBackend allows for that distinction but it's not used by buffer
manager.
Could you store the backendid in BufferDesc, outside of BufferTag? Is it
possible for a normal table and a temporary table to have the same
relfilenode+dbid+spcid triplet?
- Heikki
On 2018-05-25 09:40:10 +0300, Heikki Linnakangas wrote:
On 25/05/18 09:25, Asim Praveen wrote:
My parochial vision of the overhead is restricted to 4 * NBuffers of
additional shared memory, as 4 bytes are being added to BufferTag. May I
please get some enlightenment?Any extra fields in BufferTag make computing the hash more expensive. It's a
very hot code path, so any cycles spent are significant.
Indeed, very much so.
But I'm not sure we need anything in the tags themselves. We don't
denote buffers for unlogged tables in the tag itself either. As Tom
observed the oids for temp tables are either unique or can be made
unique easy enough. And the temporaryness can be declared in a bit in
the buffer header, rather than the tag itself. I don't see why a hash
lookup would need to know that.
In relation to Andres' patches to rewrite the buffer manager with a radix
tree, there was actually some discussion of trying to make BufferTag
*smaller*.
FWIW, in the latest version that doesn't matter that much
anymore. Instead of one big tree it's a hashtable of trees (although it
potentially should rather be a tree of trees). The hashtable maps to a
radix tree, and that radix tree is just indexed by the offset. The root
of the tree is then cached inside the smgr, avoiding the need to
repeatedly look it up.
For example, we could rearrange things so that
pg_class.relfilenode is 64 bits wide. Then you could assume that it never
wraps around, and is unique across all relations in the cluster. Then you
could replace the 12-byte relfilenode+dbid+spcid triplet, with just the
8-byte relfilenode. Doing something like that might be the solution here,
too.
OTOH it's quite useful to have the buffertag be something that can (or
rather could) be efficiently searched for in a hierachical
fashion. While, by far, not as crucial performancewise as dropping an
individual relation, it would be nice not to have to scan all of s_b to
drop a database.
Temp tables have unique filename on disk: t_<backendID>_<relfilenode>. The
logic to assign OIDs and relfilenodes, however, doesn't differ. Given a
RelFileNode, it is not possible to tell if it's a temp table or not.
RelFileNodeBackend allows for that distinction but it's not used by buffer
manager.Could you store the backendid in BufferDesc, outside of BufferTag? Is it
possible for a normal table and a temporary table to have the same
relfilenode+dbid+spcid triplet?
When starting to work on the radix tree stuff I had, to address the size
of buffer tag issue you mention above, a prototype patch that created a
shared 'relfilenode' table. That guaranteed that relfilenodes are
unique. That'd work here as well, and would allow to get rid of a good
chunk of uglyness we have around allocating relfilenodes right now (like
not unlinking files etc).
But more generally, I don't see why it'd be that problematic to just get
rid of the backendid? I don't really see any technical necessity to have
it.
Greetings,
Andres Freund
On Thu, May 24, 2018 at 11:50 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-05-25 09:40:10 +0300, Heikki Linnakangas wrote:
On 25/05/18 09:25, Asim Praveen wrote:
My parochial vision of the overhead is restricted to 4 * NBuffers of
additional shared memory, as 4 bytes are being added to BufferTag.May I
please get some enlightenment?
Any extra fields in BufferTag make computing the hash more expensive.
It's a
very hot code path, so any cycles spent are significant.
Indeed, very much so.
But I'm not sure we need anything in the tags themselves. We don't
denote buffers for unlogged tables in the tag itself either. As Tom
observed the oids for temp tables are either unique or can be made
unique easy enough. And the temporaryness can be declared in a bit in
the buffer header, rather than the tag itself. I don't see why a hash
lookup would need to know that.
Currently, relfilenodes (specifically spcid,dbid,relfilenode) for temp and
regular tables can collide as temp files have "t_nnn" representation
on-disk. Due to this relfilenode allocation logic can assign same
relfilenode for temp and non-temp. If relfilenode uniqueness can be
achieved then need for adding anything to buffer tag goes away.
When starting to work on the radix tree stuff I had, to address the size
of buffer tag issue you mention above, a prototype patch that created a
shared 'relfilenode' table. That guaranteed that relfilenodes are
unique. That'd work here as well, and would allow to get rid of a good
chunk of uglyness we have around allocating relfilenodes right now (like
not unlinking files etc).
That would be great!
But more generally, I don't see why it'd be that problematic to just get
rid of the backendid? I don't really see any technical necessity to have
it.
Backendid was also added it seems due to same reason of not having unique
relfilnodes for temp tables. So, yes with uniqueness guaranteed this can go
away as well.
On Fri, May 25, 2018 at 6:33 AM, Asim Praveen <apraveen@pivotal.io> wrote:
Hello
We are evaluating the use of shared buffers for temporary tables. The
advantage being queries involving temporary tables can make use of parallel
workers.
This is one way, but I think there are other choices as well. We can
identify and flush all the dirty (local) buffers for the relation
being accessed parallelly. Now, once the parallel operation is
started, we won't allow performing any write operation on them. It
could be expensive if we have a lot of dirty local buffers for a
particular relation. I think if we are worried about the cost of
writes, then we can try some different way to parallelize temporary
table scan. At the beginning of the scan, leader backend will
remember the dirty blocks present in local buffers, it can then share
the list with parallel workers which will skip scanning those blocks
and in the end leader ensures that all those blocks will be scanned by
the leader. This shouldn't incur a much additional cost as the
skipped blocks should be present in local buffers of backend.
I understand that none of these alternatives are straight-forward, but
I think it is worth considering whether we have any better way to
allow parallel temporary table scans.
Challenges:
1. We lose the performance benefit of local buffers.
Yeah, I think cases, where we need to drop temp relations, will become
costlier as they have to traverse all the shared buffers instead of
just local buffers.
I think if we use shared buffers for temp relations, there will be
some overhead for other backends as well, especially for the cases
when backends need to evict buffers. It is quite possible that if the
relation is in local buffers, we might not write it at all, but moving
it to shared buffers will increase its probability of being written to
disk.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hi Amit
On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
This is one way, but I think there are other choices as well. We can
identify and flush all the dirty (local) buffers for the relation
being accessed parallelly. Now, once the parallel operation is
started, we won't allow performing any write operation on them. It
We talked about this in person in Ottawa and it was great meeting you!
To summarize, the above proposal to continue using local buffers for
temp tables is a step forward, however, it enables only certain kinds
of queries to be parallelized for temp tables. E.g. queries changing
a temp table in any way cannot be parallelized due to the restriction
of no writes during parallel operation.
Yeah, I think cases, where we need to drop temp relations, will become
costlier as they have to traverse all the shared buffers instead of
just local buffers.
This is a valid concern. The idea of using a radix tree of block
numbers as proposed by Andres [1]/messages/by-id/20150912201941.GA8311@alap3.anarazel.de is worth pursuing. Cost of
identifying and dropping shared buffers belonging to a relation using
radix tree would be reduced to O(log n).
Asim
[1]: /messages/by-id/20150912201941.GA8311@alap3.anarazel.de
On Sat, Jun 2, 2018 at 05:18:17PM -0400, Asim Praveen wrote:
Hi Amit
On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
This is one way, but I think there are other choices as well. We can
identify and flush all the dirty (local) buffers for the relation
being accessed parallelly. Now, once the parallel operation is
started, we won't allow performing any write operation on them. ItWe talked about this in person in Ottawa and it was great meeting you!
To summarize, the above proposal to continue using local buffers for
temp tables is a step forward, however, it enables only certain kinds
of queries to be parallelized for temp tables. E.g. queries changing
a temp table in any way cannot be parallelized due to the restriction
of no writes during parallel operation.
Should this be a TODO item?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
On Wed, Jun 20, 2018 at 8:47 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, Jun 2, 2018 at 05:18:17PM -0400, Asim Praveen wrote:
Hi Amit
On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
This is one way, but I think there are other choices as well. We can
identify and flush all the dirty (local) buffers for the relation
being accessed parallelly. Now, once the parallel operation is
started, we won't allow performing any write operation on them. ItWe talked about this in person in Ottawa and it was great meeting you!
To summarize, the above proposal to continue using local buffers for
temp tables is a step forward, however, it enables only certain kinds
of queries to be parallelized for temp tables. E.g. queries changing
a temp table in any way cannot be parallelized due to the restriction
of no writes during parallel operation.Should this be a TODO item?
+1. I think we have not hammered out the design completely, but if
somebody is willing to put effort, it is not an unsolvable problem.
AFAIU, this thread is about parallelizing queries that refer temp
tables, however, it is not clear from the title of this thread.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 25, 2018 at 6:33 AM, Asim Praveen <apraveen@pivotal.io> wrote:
We are evaluating the use of shared buffers for temporary tables. The
advantage being queries involving temporary tables can make use of parallel
workers.This is one way, but I think there are other choices as well. We can
identify and flush all the dirty (local) buffers for the relation
being accessed parallelly. Now, once the parallel operation is
started, we won't allow performing any write operation on them. It
could be expensive if we have a lot of dirty local buffers for a
particular relation. I think if we are worried about the cost of
writes, then we can try some different way to parallelize temporary
table scan. At the beginning of the scan, leader backend will
remember the dirty blocks present in local buffers, it can then share
the list with parallel workers which will skip scanning those blocks
and in the end leader ensures that all those blocks will be scanned by
the leader. This shouldn't incur a much additional cost as the
skipped blocks should be present in local buffers of backend.
This sounds awkward and limiting. How about using DSA to allocate
space for the backend's temporary buffers and a dshash for lookups?
Then all backends can share, but we don't have to go through
shared_buffers.
Honestly, I don't see how pushing this through the main shared_buffers
arena is every going to work. Widening the buffer tags by another 4
bytes is a non-starter, I think. Maybe with some large rejiggering we
could get to a point where every relation, temporary or permanent, can
be identified by a single OID, regardless of tablespace, etc. But if
you wait for that to happen this might not happen for a loooong time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jun 22, 2018 at 6:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 25, 2018 at 6:33 AM, Asim Praveen <apraveen@pivotal.io> wrote:
We are evaluating the use of shared buffers for temporary tables. The
advantage being queries involving temporary tables can make use of parallel
workers.This is one way, but I think there are other choices as well. We can
identify and flush all the dirty (local) buffers for the relation
being accessed parallelly. Now, once the parallel operation is
started, we won't allow performing any write operation on them. It
could be expensive if we have a lot of dirty local buffers for a
particular relation. I think if we are worried about the cost of
writes, then we can try some different way to parallelize temporary
table scan. At the beginning of the scan, leader backend will
remember the dirty blocks present in local buffers, it can then share
the list with parallel workers which will skip scanning those blocks
and in the end leader ensures that all those blocks will be scanned by
the leader. This shouldn't incur a much additional cost as the
skipped blocks should be present in local buffers of backend.This sounds awkward and limiting. How about using DSA to allocate
space for the backend's temporary buffers and a dshash for lookups?
That's a better idea.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 21, 2018 at 07:42:54AM +0530, Amit Kapila wrote:
On Wed, Jun 20, 2018 at 8:47 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, Jun 2, 2018 at 05:18:17PM -0400, Asim Praveen wrote:
Hi Amit
On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
This is one way, but I think there are other choices as well. We can
identify and flush all the dirty (local) buffers for the relation
being accessed parallelly. Now, once the parallel operation is
started, we won't allow performing any write operation on them. ItWe talked about this in person in Ottawa and it was great meeting you!
To summarize, the above proposal to continue using local buffers for
temp tables is a step forward, however, it enables only certain kinds
of queries to be parallelized for temp tables. E.g. queries changing
a temp table in any way cannot be parallelized due to the restriction
of no writes during parallel operation.Should this be a TODO item?
+1. I think we have not hammered out the design completely, but if
somebody is willing to put effort, it is not an unsolvable problem.
AFAIU, this thread is about parallelizing queries that refer temp
tables, however, it is not clear from the title of this thread.
Seems it is already documented on the wiki:
https://wiki.postgresql.org/wiki/Parallel_Query#What_Parts_of_a_Query_Can_Run_In_Parallel.3F
o Scans of plain tables may not appear below Gather if (1) they are
temporary tables ...
----------------
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +