Postgres with pthread
Hi hackers,
As far as I remember, several years ago when implementation of
intra-query parallelism was just started there was discussion whether to
use threads or leave traditional Postgres process architecture. The
decision was made to leave processes. So now we have bgworkers, shared
message queue, DSM, ...
The main argument for such decision was that switching to threads will
require rewriting of most of Postgres code.
It seems to be quit reasonable argument and and until now I agreed with it.
But recently I wanted to check it myself.
The first problem with porting Postgres to pthreads is static variables
widely used in Postgres code.
Most of modern compilers support thread local variables, for example GCC
provides __thread keyword.
Such variables are placed in separate segment which is address through
segment register (at Intel).
So access time to such variables is the same as to normal static variables.
Certainly may be not all compilers have builtin support of TLS and may
be not at all hardware platforms them are implemented ias efficiently as
at Intel.
So certainly such approach decreases portability of Postgres. But IMHO
it is not so critical.
What I have done:
1. Add session_local (defined as __thread) to definition of most of
static and global variables.
I leaved some variables pointed to shared memory as static. Also I have
to changed initialization of some static variables,
because address of TLS variable can not be used in static initializers.
2. Change implementation of GUCs to make them thread specific.
3. Replace fork() with pthread_create
4. Rewrite file descriptor cache to be global (shared by all threads).
I have not changed all Postgres synchronization primitives and shared
memory.
It took me about one week of work.
What is not done yet:
1. Handling of signals (I expect that Win32 code can be somehow reused
here).
2. Deallocation of memory and closing files on backend (thread) termination.
3. Interaction of postmaster and backends with PostgreSQL auxiliary
processes (threads), such as autovacuum, bgwriter, checkpointer, stat
collector,...
What are the advantages of using threads instead of processes?
1. No need to use shared memory. So there is no static limit for amount
of memory which can be used by Postgres. No need in distributed shared
memory and other stuff designed to share memory between backends and
bgworkers.
2. Threads significantly simplify implementation of parallel algorithms:
interaction and transferring data between threads can be done easily and
more efficiently.
3. It is possible to use more efficient/lightweight synchronization
primitives. Postgres now mostly relies on its own low level
sync.primitives which user-level implementation
is using spinlocks and atomics and then fallback to OS semaphores/poll.
I am not sure how much gain can we get by replacing this primitives with
one optimized for threads.
My colleague from Firebird community told me that just replacing
processes with threads can obtain 20% increase of performance, but it is
just first step and replacing sync. primitive
can give much greater advantage. But may be for Postgres with its low
level primitives it is not true.
4. Threads are more lightweight entities than processes. Context switch
between threads takes less time than between process. And them consume
less memory. It is usually possible to spawn more threads than processes.
5. More efficient access to virtual memory. As far as all threads are
sharing the same memory space, TLB is used much efficiently in this case.
6. Faster backend startup. Certainly starting backend at each user's
request is bad thing in any case. Some kind of connection pooling should
be used in any case to provide acceptable performance. But in any case,
start of new backend process in postgres causes a lot of page faults
which have dramatical impact on performance. And there is no such
problem with threads.
Certainly, processes are also having some advantages comparing with threads:
1. Better isolation and error protection
2. Easier error handling
3. Easier control of used resources
But it is a theory. The main idea of this prototype was to prove or
disprove this expectation at practice.
I didn't expect large differences in performance because synchronization
primitives are not changed and I performed my experiments at Linux where
threads/processes are implemented in similar way.
Below are some results (1000xTPS) of select-only (-S) pgbench with scale
100 at my desktop with quad-core i7-4770 3.40GHz and 16Gb of RAM:
Connections Vanilla/default Vanilla/prepared
pthreads/defaultpthreads/prepared
10 100 191
106 207
100 67 131
105 168
1000 41 65
55 102
As you can see, for small number of connection results are almost
similar. But for large number of connection pthreads provide less
degradation.
You can look at my prototype here:
https://github.com/postgrespro/postgresql.pthreads.git
But please notice that it is very raw prototype. A lot of stuff is not
working yet. And supporting all of exited Postgres functionality requires
much more efforts (and even more efforts are needed for optimizing
Postgres for this architecture).
I just want to receive some feedback and know if community is interested
in any further work in this direction.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
Below are some results (1000xTPS) of select-only (-S) pgbench with scale
100 at my desktop with quad-core i7-4770 3.40GHz and 16Gb of RAM:
Connections Vanilla/default Vanilla/prepared
pthreads/defaultpthreads/prepared
10 100 191
106 207
100 67 131
105 168
1000 41 65
55 102
This table is so mangled that I'm not very sure what it's saying.
Maybe you should have made it an attachment?
However, if I guess at which numbers are supposed to be what,
it looks like even the best case is barely a 50% speedup.
That would be worth pursuing if it were reasonably low-hanging
fruit, but converting PG to threads seems very far from being that.
I think you've done us a very substantial service by pursuing
this far enough to get some quantifiable performance results.
But now that we have some results in hand, I think we're best
off sticking with the architecture we've got.
regards, tom lane
Here it is formatted a little better.
So a little over 50% performance improvement for a couple of the test cases.
On Wed, Dec 6, 2017 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
Below are some results (1000xTPS) of select-only (-S) pgbench with scale
100 at my desktop with quad-core i7-4770 3.40GHz and 16Gb of RAM:Connections Vanilla/default Vanilla/prepared
pthreads/defaultpthreads/prepared
10 100 191
106 207
100 67 131
105 168
1000 41 65
55 102This table is so mangled that I'm not very sure what it's saying.
Maybe you should have made it an attachment?However, if I guess at which numbers are supposed to be what,
it looks like even the best case is barely a 50% speedup.
That would be worth pursuing if it were reasonably low-hanging
fruit, but converting PG to threads seems very far from being that.I think you've done us a very substantial service by pursuing
this far enough to get some quantifiable performance results.
But now that we have some results in hand, I think we're best
off sticking with the architecture we've got.regards, tom lane
Attachments:
Hi!
On 2017-12-06 19:40:00 +0300, Konstantin Knizhnik wrote:
As far as I remember, several years ago when implementation of intra-query
parallelism was just started there was discussion whether to use threads or
leave traditional Postgres process architecture. The decision was made to
leave processes. So now we have bgworkers, shared message queue, DSM, ...
The main argument for such decision was that switching to threads will
require rewriting of most of Postgres code.
It seems to be quit reasonable argument and and until now I agreed with it.
But recently I wanted to check it myself.
I think that's something pretty important to play with. There've been
several discussions lately, both on and off list / in person, that we're
taking on more-and-more technical debt just because we're using
processes. Besides the above, we've grown:
- a shared memory allocator
- a shared memory hashtable
- weird looking thread aware pointers
- significant added complexity in various projects due to addresses not
being mapped to the same address etc.
The first problem with porting Postgres to pthreads is static variables
widely used in Postgres code.
Most of modern compilers support thread local variables, for example GCC
provides __thread keyword.
Such variables are placed in separate segment which is address through
segment register (at Intel).
So access time to such variables is the same as to normal static variables.
I experimented similarly. Although I'm not 100% sure that if were to go
for it, we wouldn't instead want to abstract our session concept
further, or well, at all.
Certainly may be not all compilers have builtin support of TLS and may be
not at all hardware platforms them are implemented ias efficiently as at
Intel.
So certainly such approach decreases portability of Postgres. But IMHO it is
not so critical.
I'd agree there, but I don't think the project necessarily does.
What I have done:
1. Add session_local (defined as __thread) to definition of most of static
and global variables.
I leaved some variables pointed to shared memory as static. Also I have to
changed initialization of some static variables,
because address of TLS variable can not be used in static initializers.
2. Change implementation of GUCs to make them thread specific.
3. Replace fork() with pthread_create
4. Rewrite file descriptor cache to be global (shared by all threads).
That one I'm very unconvinced of, that's going to add a ton of new
contention.
What are the advantages of using threads instead of processes?
1. No need to use shared memory. So there is no static limit for amount of
memory which can be used by Postgres. No need in distributed shared memory
and other stuff designed to share memory between backends and
bgworkers.
This imo is the biggest part. We can stop duplicating OS and our own
implementations in a shmem aware way.
2. Threads significantly simplify implementation of parallel algorithms:
interaction and transferring data between threads can be done easily and
more efficiently.
That's imo the same as 1.
3. It is possible to use more efficient/lightweight synchronization
primitives. Postgres now mostly relies on its own low level sync.primitives
which user-level implementation
is using spinlocks and atomics and then fallback to OS semaphores/poll. I am
not sure how much gain can we get by replacing this primitives with one
optimized for threads.
My colleague from Firebird community told me that just replacing processes
with threads can obtain 20% increase of performance, but it is just first
step and replacing sync. primitive
can give much greater advantage. But may be for Postgres with its low level
primitives it is not true.
I don't believe that that's actually the case to any significant degree.
6. Faster backend startup. Certainly starting backend at each user's request
is bad thing in any case. Some kind of connection pooling should be used in
any case to provide acceptable performance. But in any case, start of new
backend process in postgres causes a lot of page faults which have
dramatical impact on performance. And there is no such problem with threads.
I don't buy this in itself. The connection establishment overhead isn't
largely the fork, it's all the work afterwards. I do think it makes
connection pooling etc easier.
I just want to receive some feedback and know if community is interested in
any further work in this direction.
I personally am. I think it's beyond high time that we move to take
advantage of threads.
That said, I don't think just replacing threads is the right thing. I'm
pretty sure we'd still want to have postmaster as a separate process,
for robustness. Possibly we even want to continue having various
processes around besides that, the most interesting cases involving
threads are around intra-query parallelism, and pooling, and for both a
hybrid model could be beneficial.
I think that we probably initially want some optional move to
threads. Most extensions won't initially be thread ready, and imo we
should continue to work with that for a while, just refusing to use
parallelism if any loaded shared library doesn't signal parallelism
support. We also don't necessarily want to require threads on all
platforms at the same time.
I think the biggest problem with doing this for real is that it's a huge
project, and that it'll take a long time.
Thanks for working on this!
Andres Freund
On Wed, Dec 6, 2017 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
barely a 50% speedup.
I think that's an awfully strange choice of adverb. This is, by its
authors own admission, a rough cut at this, probably with very little
of the optimization that could ultimately done, and it's already
buying 50% on some test cases? That sounds phenomenally good to me.
A 50% speedup is huge, and chances are that it can be made quite a bit
better with more work, or that it already is quite a bit better with
the right test case.
TBH, based on previous discussion, I expected this to initially be
*slower* but still worthwhile in the long run because of optimizations
that it would let us do eventually with parallel query and other
things. If it's this much faster out of the gate, that's really
exciting.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2017-12-06 11:53:21 -0500, Tom Lane wrote:
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
However, if I guess at which numbers are supposed to be what,
it looks like even the best case is barely a 50% speedup.
"barely a 50% speedup" - Hah. I don't believe the numbers, but that'd be
huge.
That would be worth pursuing if it were reasonably low-hanging
fruit, but converting PG to threads seems very far from being that.
I don't think immediate performance gains are the interesting part about
using threads. It's rather what their absence adds a lot in existing /
submitted code complexity, and makes some very commonly requested
features a lot harder to implement:
- we've a lot of duplicated infrastructure around dynamic shared
memory. dsm.c dsa.c, dshash.c etc. A lot of these, especially dsa.c,
are going to become a lot more complicated over time, just look at how
complicated good multi threaded allocators are.
- we're adding a lot of slowness to parallelism, just because we have
different memory layouts in different processes. Instead of just
passing pointers through queues, we put entire tuples in there. We
deal with dsm aware pointers.
- a lot of features have been a lot harder (parallelism!), and a lot of
frequently requested ones are so hard due to processes that they never
got off ground (in-core pooling, process reuse, parallel worker reuse)
- due to the statically sized shared memory a lot of our configuration
is pretty fundamentally PGC_POSTMASTER, even though that present a lot
of administrative problems.
...
I think you've done us a very substantial service by pursuing
this far enough to get some quantifiable performance results.
But now that we have some results in hand, I think we're best
off sticking with the architecture we've got.
I don't agree.
I'd personally expect that an immediate conversion would result in very
little speedup, a bunch of code deleted, a bunch of complexity
added. And it'd still be massively worthwhile, to keep medium to long
term complexity and feature viability in control.
Greetings,
Andres Freund
"barely a 50% speedup" - Hah. I don't believe the numbers, but that'd be
huge.
They are numbers derived from a benchmark that any sane person would
be using a connection pool for in a production environment, but
impressive if true none the less.
On 12/06/2017 06:08 PM, Andres Freund wrote:
I think the biggest problem with doing this for real is that it's a huge
project, and that it'll take a long time.
An additional issue is that this could break a lot of extensions and in
a way that it is not apparent at compile time. This means we may need to
break all extensions to force extensions authors to check if they are
thread safe.
I do not like making life hard for out extension community, but if the
gains are big enough it might be worth it.
Thanks for working on this!
Seconded.
Andreas
On Wed, Dec 6, 2017 at 12:08 PM, Andres Freund <andres@anarazel.de> wrote:
4. Rewrite file descriptor cache to be global (shared by all threads).
That one I'm very unconvinced of, that's going to add a ton of new
contention.
It might be OK on systems where we can use pread()/pwrite().
Otherwise it's going to be terrible.
That said, I don't think just replacing threads is the right thing. I'm
pretty sure we'd still want to have postmaster as a separate process,
for robustness.
+1. The tendency of the postmaster to not die has been a huge boon to
the reliability of PostgreSQL - I would not like to give that up.
MySQL ends up needing safe_mysqld to cope with this issue; our idea of
having it built into the server is better.
Possibly we even want to continue having various
processes around besides that, the most interesting cases involving
threads are around intra-query parallelism, and pooling, and for both a
hybrid model could be beneficial.
I think if we only use threads for intra-query parallelism we're
leaving a lot of money on the table. For example, if all
shmem-connected backends are using the same process, then we can make
max_locks_per_transaction PGC_SIGHUP. That would be sweet, and there
are probably plenty of similar things. Moreover, if threads are this
thing that we only use now and then for parallel query, then our
support for them will probably have bugs. If we use them all the
time, we'll actually find the bugs and fix them. I hope.
I think the biggest problem with doing this for real is that it's a huge
project, and that it'll take a long time.
+1
Thanks for working on this!
+1
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2017-12-06 12:28:29 -0500, Robert Haas wrote:
Possibly we even want to continue having various
processes around besides that, the most interesting cases involving
threads are around intra-query parallelism, and pooling, and for both a
hybrid model could be beneficial.I think if we only use threads for intra-query parallelism we're
leaving a lot of money on the table. For example, if all
shmem-connected backends are using the same process, then we can make
max_locks_per_transaction PGC_SIGHUP. That would be sweet, and there
are probably plenty of similar things. Moreover, if threads are this
thing that we only use now and then for parallel query, then our
support for them will probably have bugs. If we use them all the
time, we'll actually find the bugs and fix them. I hope.
I think it'd make a lot of sense to go there gradually. I agree that we
probably want to move to more and more use of threads, but we also want
our users not to kill us ;). Initially we'd surely continue to use
partitioned dynahash for locks, which'd make resizing infeasible
anyway. Similar for shared buffers (which I find a hell of a lot more
interesting to change at runtime than max_locks_per_transaction), etc...
- Andres
On Thu, Dec 7, 2017 at 6:08 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-12-06 19:40:00 +0300, Konstantin Knizhnik wrote:
As far as I remember, several years ago when implementation of intra-query
parallelism was just started there was discussion whether to use threads or
leave traditional Postgres process architecture. The decision was made to
leave processes. So now we have bgworkers, shared message queue, DSM, ...
The main argument for such decision was that switching to threads will
require rewriting of most of Postgres code.It seems to be quit reasonable argument and and until now I agreed with it.
But recently I wanted to check it myself.
I think that's something pretty important to play with. There've been
several discussions lately, both on and off list / in person, that we're
taking on more-and-more technical debt just because we're using
processes. Besides the above, we've grown:
- a shared memory allocator
- a shared memory hashtable
- weird looking thread aware pointers
- significant added complexity in various projects due to addresses not
being mapped to the same address etc.
Yes, those are all workarounds for an ancient temporary design choice.
To quote from a 1989 paper[1]http://db.cs.berkeley.edu/papers/ERL-M90-34.pdf "Currently, POSTGRES runs as one process
for each active user. This was done as an expedient to get a system
operational as quickly as possible. We plan on converting POSTGRES to
use lightweight processes [...]". +1 for sticking to the plan.
While personally contributing to the technical debt items listed
above, I always imagined that all that machinery could become
compile-time options controlled with --with-threads and
dsa_get_address() would melt away leaving only a raw pointers, and
dsa_area would forward to the MemoryContext + ResourceOwner APIs, or
something like that. It's unfortunate that we lose type safety along
the way though. (If only there were some way we could write
dsa_pointer<my_type>. In fact it was also a goal of the original
project to adopt C++, based on a comment in 4.2's nodes.h: "Eventually
this code should be transmogrified into C++ classes, and this is more
or less compatible with those things.")
If there were a good way to reserve (but not map) a large address
range before forking, there could also be an intermediate build mode
that keeps the multi-process model but where DSA behaves as above,
which might an interesting way to decouple the
DSA-go-faster-and-reduce-tech-debt project from the threading project.
We could manage the reserved address space ourselves and map DSM
segments with MAP_FIXED, so dsa_get_address() address decoding could
be compiled away. One way would be to mmap a huge range backed with
/dev/zero, and then map-with-MAP_FIXED segments over the top of it and
then remap /dev/zero back into place when finished, but that sucks
because it gives you that whole mapping in your core files and relies
on overcommit which we don't like, hence my interest in a way to
reserve but not map.
The first problem with porting Postgres to pthreads is static variables
widely used in Postgres code.
Most of modern compilers support thread local variables, for example GCC
provides __thread keyword.
Such variables are placed in separate segment which is address through
segment register (at Intel).
So access time to such variables is the same as to normal static variables.I experimented similarly. Although I'm not 100% sure that if were to go
for it, we wouldn't instead want to abstract our session concept
further, or well, at all.
Using a ton of thread local variables may be a useful stepping stone,
but if we want to be able to separate threads/processes from sessions
eventually then I guess we'll want to model sessions as first class
objects and pass them around explicitly or using a single TLS variable
current_session.
I think the biggest problem with doing this for real is that it's a huge
project, and that it'll take a long time.Thanks for working on this!
+1
[1]: http://db.cs.berkeley.edu/papers/ERL-M90-34.pdf
--
Thomas Munro
http://www.enterprisedb.com
On 7 December 2017 at 01:17, Andres Freund <andres@anarazel.de> wrote:
I think you've done us a very substantial service by pursuing
this far enough to get some quantifiable performance results.
But now that we have some results in hand, I think we're best
off sticking with the architecture we've got.I don't agree.
I'd personally expect that an immediate conversion would result in very
little speedup, a bunch of code deleted, a bunch of complexity
added. And it'd still be massively worthwhile, to keep medium to long
term complexity and feature viability in control.
Personally I think it's a pity we didn't land up here before the
foundations for parallel query went in - DSM, shm_mq, DSA, etc. I know the
EDB folks at least looked into it though, and presumably there were good
reasons to go in this direction. Maybe that was just "community will never
accept threaded conversion" at the time, though.
Now we have quite a lot of homebrew infrastructure to consider if we do a
conversion.
That said, it might in some ways make it easier. shm_mq, for example, would
likely convert to a threaded backend with minimal changes to callers, and
probably only limited changes to shm_mq its self. So maybe these
abstractions will prove to have been a win in some ways. Except DSA, and
even then it could serve as a transitional API...
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 7 December 2017 at 05:58, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
Using a ton of thread local variables may be a useful stepping stone,
but if we want to be able to separate threads/processes from sessions
eventually then I guess we'll want to model sessions as first class
objects and pass them around explicitly or using a single TLS variable
current_session.
Yep.
This is the real reason I'm excited by the idea of a threading conversion.
PostgreSQL's architecture conflates "connection", "session" and "executor"
into one somewhat muddled mess. I'd love to be able to untangle that to the
point where we can pool executors amongst active queries, while retaining
idle sessions' state properly even while they're in a transaction.
Yeah, that's a long way off, but it'd be a whole lot more practical if we
didn't have to serialize and deserialize the entire session state to do it.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: Craig Ringer [mailto:craig@2ndquadrant.com]
I'd personally expect that an immediate conversion would result
in very
little speedup, a bunch of code deleted, a bunch of complexity
added. And it'd still be massively worthwhile, to keep medium to
long
term complexity and feature viability in control.
+1
I hope for things like:
* More performance statistics like system-wide LWLock waits, without the concern about fixed shared memory size
* Dynamic memory sizing, such as shared_buffers, work_mem, maintenance_work_mem
* Running multi-threaded components in postgres extension (is it really safe to run JVM for PL/Java in a single-threaded postgres?)
Regards
Takayuki Tsunakawa
On 7 December 2017 at 11:44, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:
From: Craig Ringer [mailto:craig@2ndquadrant.com]
I'd personally expect that an immediate conversion would result
in very
little speedup, a bunch of code deleted, a bunch of complexity
added. And it'd still be massively worthwhile, to keep medium to
long
term complexity and feature viability in control.+1
I hope for things like:
* More performance statistics like system-wide LWLock waits, without the
concern about fixed shared memory size
* Dynamic memory sizing, such as shared_buffers, work_mem,
maintenance_work_mem
I'm not sure how threaded operations would help us much there. If we could
split shared_buffers into extents we could do this with something like dsm
already. Without the ability to split it into extents, we can't do it with
locally malloc'd memory in a threaded system either.
Re performance diagnostics though, you can already get a lot of useful data
from PostgreSQL's SDT tracepoints, which are usable with perf and DTrace
amongst other tools. Dynamic userspace 'perf' probes can tell you a lot too.
I'm confident you could collect some seriously useful data with perf
tracepoints and 'perf script' these days. (BTW, I extended the
https://wiki.postgresql.org/wiki/Profiling_with_perf article a bit
yesterday with some tips on this).
Of course better built-in diagnostics would be nice. But I really don't see
how it'd have much to do with threaded vs forked model of execution; we can
allocate chunks of memory with dsm now, after all.
* Running multi-threaded components in postgres extension (is it really
safe to run JVM for PL/Java in a single-threaded postgres?)
PL/Java is a giant mess for so many more reasons than that. The JVM is a
heavyweight startup, lightweight thread model system. It doesn't play at
all well with postgres's lightweight process fork()-based CoW model. You
can't fork() the JVM because fork() doesn't play nice with threads, at all.
So you have to start it in each backend individually, which is just awful.
One of the nice things if Pg got a threaded model would be that you could
embed a JVM, Mono/.NET runtime, etc and have your sessions work together in
ways you cannot currently sensibly do. Folks using MS SQL, Oracle, etc are
pretty used to being able to do this, and while it should be done with
caution it can offer huge benefits for some complex workloads.
Right now if a PostgreSQL user wants to do anything involving IPC, shared
data, etc, we pretty much have to write quite complex C extensions to do it.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
But it is a theory. The main idea of this prototype was to prove or disprove
this expectation at practice.
But please notice that it is very raw prototype. A lot of stuff is not
working yet.
And supporting all of exited Postgres functionality requires
much more efforts (and even more efforts are needed for optimizing Postgres
for this architecture).I just want to receive some feedback and know if community is interested in
any further work in this direction.
Looks good. You are right, it is a theory. If your prototype does
actually show what we think it does then it is a good and interesting
result.
I think we need careful analysis to show where these exact gains come
from. The actual benefit is likely not evenly distributed across the
list of possible benefits. Did they arise because you produced a
stripped down version of Postgres? Or did they arise from using
threads?
It would not be the first time a result shown in protoype did not show
real gains on a completed project.
I might also read your results to show that connection concentrators
would be a better area of work, since 100 connections perform better
than 1000 in both cases, so why bother optimising for 1000 connections
at all? In which case we should read the benefit at the 100
connections line, where it shows the lower 28% gain, closer to the
gain your colleague reported.
So I think we don't yet have enough to make a decision.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I want to thank everybody for feedbacks and a lot of useful notices.
I am very pleased with interest of community to this topic and will
continue research in this direction.
Some more comments from my side:
My original intention was to implement some king of built-in connection
pooling for Postgres: be able to execute several transactions into one
backend.
It requires use of some kind lightweight multitasking (coroutines). The
obvious candidate for it is libcore.
In this case we also need to solve the problem with static variables.
And __thread will not help in this case. We have to collect all static
variables into some structure (context)
and replace any references to such variable with indirection through
pointer. It will be much harder to implement than annotating variable
definitions with __thread:
it will require change of all accesses to variables, so almost all
Postgres code has to be refactored.
Another problem with this approach is that we need asynchronous disk IO
for it. Unfortunately this is no good file AIO implementation for Linux.
Certainly we can spawn dedicated IO thread (or threads) and queue IO
requests to it. But such architecture seems to become quite complex.
Also cooperative multitasking itself is not able to load all CPU cores.
So we need to have several physical processes/threads which will execute
coroutines.
In theory such architecture should provide the best performance and
scalability (handle hundreds of thousands of client connections). But in
practice there are a lot of pitfals:
1. Right now each backend has its local relation, catalog and prepared
statement caches. For large database this caches can be large enough:
several megabytes.
So such coroutines becomes really not "lightweight". The obvious
solution is to have global caches or combine global and local caches.
But it once again requires significant
changes in postgres.
2. Large number of sessions makes current approach with procarray almost
unusable: we need to provide some alternative implementation of
snapshots, for example CSN based.
3. All locking mechanisms have to be rewritten.
So this approach almost exclude possibility of evolution of existed
postgres code base and requires "revolution": rewriting most of Postgres
components from scratch and refactoring almost all other postgres code.
This is why I have to abandon move in this direction.
Replacing processes with threads can be considered just as first step
and requires changes in many postgres components if we really want to
get significant advantages from it.
But at least such work can be splitted into several phases and it is
possible for some time to support both multithreaded and multiprocess
model in the same codebase.
Below I want to summarize the most important (from my point of view)
arguments pro/contra multithreaded I got from your feedbacks:
Pros:
1. Simplified memory model: no need in DSM, shm_mq, DSA, etc
2. Efficient integration of PLs supporting multithreaded execution,
first of all Java
3. Less memory footprint, faster context switching, more efficient use
of TLB
Contras:
1. Breaks compatibility with existed extensions and adds more
requirements for authors of new extension
2. Problems with integration of single-threaded PLs: Python, Lua,...
3. Worser protection from programming errors, included errors in extensions.
4. Lack of explicit separation of shared and privite memory leads to
more synchronization errors.
Right now in Postgres there is strict distinction between shared memory
and private memory, so it is clear for programmer
whether (s)he is working with shared data and so need some kind of
synchronization to avoid race condition.
In pthreads all memory is shared and more care is needed to work with it.
So pthreads can help to increase scalability, but still do not help much
in implementation of built-in connection pooling, autonomous
transactions,...
Current 50% improvement of select speed for large number of connections
certainly can not be considered as enough motivation for such radical
changes of Postgres architecture.
But it is just first step and much more benefits can be obtained by
adopting Postgres to this model.
It is hard to me to estimate now all complexity of switching to thread
model and all advantages we can get from it.
First of all I am going to repeat my benchmarks at SMP computers with
large number of cores (so that 100 or more active backends can be really
useful even in case of connection pooling).
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi
On 06.12.2017 20:08, Andres Freund wrote:
4. Rewrite file descriptor cache to be global (shared by all threads).
That one I'm very unconvinced of, that's going to add a ton of new
contention.
Do you mean lock contention because of mutex I used to synchronize
access to shared file descriptor cache
or contention for file descriptors?
Right now each thread has its own virtual file descriptors, so them are
not shared between threads.
But there is common LRU, restricting total number of opened descriptors
in the process.
Actually I have not other choice if I want to support thousands of
connection.
If each thread has its own private descriptor cache (as it is now for
processes) and its size is estimated base on open file quota,
then there will be millions of opened file descriptors.
Concerning contention for mutex, I do not think that it is a problem.
At least I have to say that performance (with 100 connections) is
significantly improved and shows almost the same speed as for 10
connections
after I have rewritten file descriptor can and made it global
(my original implementation just made all fd.c static variables as
thread local, so each thread has its separate pool).
It is possible to go further and shared file descriptors between threads
and use pwrite/pread instead of seek+read/write.
But we still need mutex to implement LRU l2list and free handler list.
On 07.12.2017 00:58, Thomas Munro wrote:
Using a ton of thread local variables may be a useful stepping stone,
but if we want to be able to separate threads/processes from sessions
eventually then I guess we'll want to model sessions as first class
objects and pass them around explicitly or using a single TLS variable
current_session.
It was my primary intention.
Unfortunately separating all static variables into some kind of session
context requires much more efforts:
we have to change all accesses to such variables.
But please notice, that from performance point of view access to
__thread variables is not more expensive then access to static variable or
access to fields of session context structure through current_session.
And there is no extra space overhead for them.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 7 December 2017 at 19:55, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
wrote:
Pros:
1. Simplified memory model: no need in DSM, shm_mq, DSA, etc
shm_mq would remain useful, and the others could only be dropped if you
also dropped process-model support entirely.
1. Breaks compatibility with existed extensions and adds more requirements
for authors of new extension
Depends on how much frightening preprocessor magic you're willing to use,
doesn't it? ;)
Wouldn't be surprised if simple extensions (C functions etc) stayed fairly
happy, but it'd be hazardous enough in terms of library use etc that
deliberate breakage may be beter.
2. Problems with integration of single-threaded PLs: Python, Lua,...
Yeah, that's going to hurt. Especially since most non-plpgsql code out
there will be plperl and plpython. Breaking that's not going to be an
option, but nobody's going to be happy if all postgres backends must
contend for the same Python GIL. Plus it'd be deadlock-city.
That's nearly a showstopper right there. Especially since with a quick look
around it looks like the cPython GIL is per-DLL (at least on Windows) not
per-interpreter-state, so spawning separate interpreter states per-thread
may not be sufficient. That makes sense given that cPython its self is
thread-aware; otherwise it'd have a really hard time figuring out which GIL
and interpreter state to look at when in a cPython-spawned thread.
3. Worser protection from programming errors, included errors in
extensions.
Mainly contaminating memory of unrelated procesess, or the postmaster.
I'm not worried about outright crashes. On any modern system it's not
significantly worse to take down the postmaster than it is to have it do
its own recovery. A modern init will restart it promptly. (If you're not
running postgres under an init daemon for production then... well, you
should be.)
4. Lack of explicit separation of shared and privite memory leads to more
synchronization errors.
Accidentally clobbering postmaster memory/state would be my main worry
there.
Right now we gain a lot of protection from our copy-on-write
shared-nothing-by-default model, and we rely on it in quite a lot of places
where backends merrily stomp on inherited postmaster state.
The more I think about it, the less enthusiastic I am, really.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services