bg worker: patch 1 of 6 - permanent process

Started by Markus Wanneralmost 16 years ago60 messageshackers
Jump to latest
#1Markus Wanner
markus@bluegap.ch

This patch turns the existing autovacuum launcher into an always running
process, partly called the coordinator. If autovacuum is disabled, the
coordinator process still gets started and keeps around, but it doesn't
dispatch vacuum jobs. The coordinator process now uses imessages to
communicate with background (autovacuum) workers and to trigger a vacuum
job. So please apply the imessages patches [1]dynshmem and imessages patch http://archives.postgresql.org/message-id/ab0cd52a64e788f4ecb4515d1e6e4691@localhost before any of the bg
worker ones.

It also adds two new controlling GUCs: min_spare_background_workers and
max_spare_background_workers. The autovacuum_max_workers still serves as
a limit for the total amount of background/autovacuum workers. (It is
going to be renamed in step 4).

Interaction with the postmaster has changed a bit. If autovacuum is
disabled, the coordinator isn't started with
PMSIGNAL_START_AUTOVAC_LAUNCHER anymore, instead there is an
IMSGT_FORCE_VACUUM that any backend might want to send to the
coordinator to prevent data loss due to XID wrap around (see changes in
access/transam/varsup.c). The SIGUSR2 from postmaster to the coordinator
doesn't need to be multiplexed anymore, but is only sent to inform about
fork failures.

A note on the dependency on imessages: for just autovacuum, this message
passing infrastructure isn't absolutely necessary and could be removed.
However, for Postgres-R it turned out to be really helpful and I think
chances are good another user of this background worker infrastructure
would also want to transfer data of varying size to and from these workers.

Just as in the current version of Postgres, the background worker
terminates immediately after having performed a vacuum job.

Open issue: if the postmaster fails to fork a new background worker, the
coordinator still waits a whole second after receiving the SIGUSR2
notification signal from the postmaster. That might have been fine with
just autovacuum, but for other jobs, namely changeset application in
Postgres-R, that's not feasible.

[1]: dynshmem and imessages patch http://archives.postgresql.org/message-id/ab0cd52a64e788f4ecb4515d1e6e4691@localhost
http://archives.postgresql.org/message-id/ab0cd52a64e788f4ecb4515d1e6e4691@localhost

Attachments:

step1-permanent_process.difftext/x-diff; charset=iso-8859-1; name=step1-permanent_process.diffDownload+537-379
#2Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Markus Wanner (#1)
Re: bg worker: patch 1 of 6 - permanent process

On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner <markus@bluegap.ch> wrote:

This patch turns the existing autovacuum launcher into an always running
process, partly called the coordinator. If autovacuum is disabled, the
coordinator process still gets started and keeps around, but it doesn't
dispatch vacuum jobs.

I think this part is a reasonable proposal, but...

The coordinator process now uses imessages to communicate with background
(autovacuum) workers and to trigger a vacuum job.
It also adds two new controlling GUCs: min_spare_background_workers and
max_spare_background_workers.

Other changes in the patch doesn't seem be always needed for the purpose.
In other words, the patch is not minimal.
The original purpose could be done without IMessage.
Also, min/max_spare_background_workers are not used in the patch at all.
(BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is
maybe typo.)

The most questionable point for me is why you didn't add any hook functions
in the coordinator process. With the patch, you can extend the coordinator
protocols with IMessage, but it requires patches to core at handle_imessage().
If you want fully-extensible workers, we should provide a method to develop
worker codes in an external plugin.

Is it possible to develop your own codes in the plugin? If possible, you can
use IMessage as a private protocol freely in the plugin. Am I missing something?

--
Itagaki Takahiro

#3Robert Haas
robertmhaas@gmail.com
In reply to: Itagaki Takahiro (#2)
Re: bg worker: patch 1 of 6 - permanent process

On Wed, Aug 25, 2010 at 9:39 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner <markus@bluegap.ch> wrote:

This patch turns the existing autovacuum launcher into an always running
process, partly called the coordinator. If autovacuum is disabled, the
coordinator process still gets started and keeps around, but it doesn't
dispatch vacuum jobs.

I think this part is a reasonable proposal, but...

It's not clear to me whether it's better to have a single coordinator
process that handles both autovacuum and other things, or whether it's
better to have two separate processes.

The coordinator process now uses imessages to communicate with background
(autovacuum) workers and to trigger a vacuum job.
It also adds two new controlling GUCs: min_spare_background_workers and
max_spare_background_workers.

Other changes in the patch doesn't seem be always needed for the purpose.
In other words, the patch is not minimal.
The original purpose could be done without IMessage.
Also, min/max_spare_background_workers are not used in the patch at all.
(BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is
maybe typo.)

The most questionable point for me is why you didn't add any hook functions
in the coordinator process. With the patch, you can extend the coordinator
protocols with IMessage, but it requires patches to core at handle_imessage().
If you want fully-extensible workers, we should provide a method to develop
worker codes in an external plugin.

I agree with this criticism, but the other thing that strikes me as a
nonstarter is having the postmaster participate in the imessages
framework. Our general rule is that the postmaster must avoid
touching shared memory; else a backend that scribbles on shared memory
might take out the postmaster, leading to a failure of the
crash-and-restart logic.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#4Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Robert Haas (#3)
Re: bg worker: patch 1 of 6 - permanent process

On Thu, Aug 26, 2010 at 11:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner <markus@bluegap.ch> wrote:

This patch turns the existing autovacuum launcher into an always running
process, partly called the coordinator.

It's not clear to me whether it's better to have a single coordinator
process that handles both autovacuum and other things, or whether it's
better to have two separate processes.

Ah, we can separate the proposal to two topics:
A. Support to run non-vacuum jobs from autovacuum launcher
B. Support "user defined background processes"

A was proposed in the original "1 of 6" patch, but B might be more general.
If we have a separated coordinator, B will be required.

Markus, do you need B? Or A + standard backend processes are enough?
If you need B eventually, starting with B might be better.

--
Itagaki Takahiro

#5Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#3)
Re: bg worker: patch 1 of 6 - permanent process

Hi,

thanks for your feedback on this, it sort of got lost below the
discussion about the dynamic shared memory stuff, IMO.

On 08/26/2010 04:39 AM, Robert Haas wrote:

It's not clear to me whether it's better to have a single coordinator
process that handles both autovacuum and other things, or whether it's
better to have two separate processes.

It has been proposed by Alvaro and/or Tom (IIRC) to reduce code
duplication. Compared to the former approach, it certainly seems cleaner
that way and it has helped reduce duplicate code a lot.

I'm envisioning such a coordinator process to handle coordination of
other background processes as well, for example for distributed and/or
parallel querying.

Having just only one process reduces the amount of interaction required
between coordinators (i.e. autovacuum shouldn't ever start on databases
for which replication didn't start, yet, as the autovacuum worker would
be unable to connect to the database at that stage). It also reduces the
amount of extra processes required, and thus I think also general
complexity.

What'd be the benefits of having separate coordinator processes? They'd
be doing pretty much the same: coordinate background processes. (And
yes, I clearly consider autovacuum to be just one kind of background
process).

I agree with this criticism, but the other thing that strikes me as a
nonstarter is having the postmaster participate in the imessages
framework.

This is simply not the case (anymore). (And one of the reasons a
separate coordinator process is required, instead of letting the
postmaster do this kind of coordination).

Our general rule is that the postmaster must avoid
touching shared memory; else a backend that scribbles on shared memory
might take out the postmaster, leading to a failure of the
crash-and-restart logic.

That rule is well understood and followed by the bg worker
infrastructure patches. If you find code for which that isn't true,
please point at it. The crash-and-restart logic should work just as it
did with the autovacuum launcher.

Regards

Markus

#6Markus Wanner
markus@bluegap.ch
In reply to: Itagaki Takahiro (#2)
Re: bg worker: patch 1 of 6 - permanent process

Itagaki-san,

thanks for reviewing this.

On 08/26/2010 03:39 AM, Itagaki Takahiro wrote:

Other changes in the patch doesn't seem be always needed for the purpose.
In other words, the patch is not minimal.

Hm.. yeah, maybe the separation between step1 and step2 is a bit
arbitrary. I'll look into it.

The original purpose could be done without IMessage.

Agreed, that's the one exception. I've mentioned why that is and I don't
currently feel like coding an unneeded variant which doesn't use imessages.

Also, min/max_spare_background_workers are not used in the patch at all.

You are right, it only starts to get used in step2, so the addition
should probably move there, right.

(BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is
maybe typo.)

Uh, correct, thank you for pointing this out. (I've originally named
these helper processes before. Merging with autovacuum, it made more
sense to name them background *workers*)

The most questionable point for me is why you didn't add any hook functions
in the coordinator process.

Because I'm a hook-hater. ;-)

No, seriously: I don't see what problem hooks could have solved. I'm
coding in C and extending the Postgres code. Deciding for hooks and an
API to use them requires good knowledge of where exactly you want to
hook and what API you want to provide. Then that API needs to remain
stable for an extended time. I don't think any part of the bg worker
infrastructure currently is anywhere close to that.

With the patch, you can extend the coordinator
protocols with IMessage, but it requires patches to core at handle_imessage().
If you want fully-extensible workers, we should provide a method to develop
worker codes in an external plugin.

It's originally intended as an internal infrastructure. Offering its
capabilities to the outside would requires stabilization, security
control and working out an API. All of which is certainly not something
I intend to do.

Is it possible to develop your own codes in the plugin? If possible, you can
use IMessage as a private protocol freely in the plugin. Am I missing something?

Well, what problem(s) are you trying to solve with such a thing? I've no
idea what direction you are aiming at, sorry. However, it's certainly
different or extending bg worker, so it would need to be a separate
patch, IMO.

Regards

Markus Wanner

#7Markus Wanner
markus@bluegap.ch
In reply to: Itagaki Takahiro (#4)
Re: bg worker: patch 1 of 6 - permanent process

On 08/26/2010 05:01 AM, Itagaki Takahiro wrote:

Markus, do you need B? Or A + standard backend processes are enough?
If you need B eventually, starting with B might be better.

No, I certainly don't need B.

Why not just use an ordinary backend to do "user defined background
processing"? It covers all of the API stability and the security issues
I've raised.

Regards

Markus Wanner

#8Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Markus Wanner (#7)
Re: bg worker: patch 1 of 6 - permanent process

On Thu, Aug 26, 2010 at 7:42 PM, Markus Wanner <markus@bluegap.ch> wrote:

Markus, do you need B? Or A + standard backend processes are enough?

No, I certainly don't need B.

OK, I see why you proposed coordinator hook (yeah, I call it hook :)
rather than adding user-defined processes.

Why not just use an ordinary backend to do "user defined background
processing"? It covers all of the API stability and the security issues I've
raised.

However, we have autovacuum worker processes in addition to normal backend
processes. Does it show a fact that there are some jobs we cannot run in
normal backends?

For example, normal backends cannot do anything in idle time, so a
time-based polling job is difficult in backends. It might be ok to
fork processes for each interval when the polling interval is long,
but it is not effective for short interval cases. I'd like to use
such kind of process as an additional stats collector.

--
Itagaki Takahiro

#9Markus Wanner
markus@bluegap.ch
In reply to: Itagaki Takahiro (#8)
Re: bg worker: patch 1 of 6 - permanent process

Itagaki-san,

On 08/26/2010 01:02 PM, Itagaki Takahiro wrote:

OK, I see why you proposed coordinator hook (yeah, I call it hook :)
rather than adding user-defined processes.

I see. If you call that a hook, I'm definitely not a hook-hater ;-) at
least not according to your definition.

However, we have autovacuum worker processes in addition to normal backend
processes. Does it show a fact that there are some jobs we cannot run in
normal backends?

Hm.. understood. You can use VACUUM from a cron job. And that's the
problem autovacuum solves. So in a way, that's just a convenience
feature. You want the same for general purpose user defined background
processing, right?

For example, normal backends cannot do anything in idle time, so a
time-based polling job is difficult in backends. It might be ok to
fork processes for each interval when the polling interval is long,
but it is not effective for short interval cases. I'd like to use
such kind of process as an additional stats collector.

Did you follow the discussion I had with Dimitri, who was trying
something similar, IIRC. See the bg worker - overview thread. There
might be some interesting bits thinking into that direction.

Regards

Markus

#10Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#5)
Re: bg worker: patch 1 of 6 - permanent process

On Thu, Aug 26, 2010 at 6:07 AM, Markus Wanner <markus@bluegap.ch> wrote:

What'd be the benefits of having separate coordinator processes? They'd be
doing pretty much the same: coordinate background processes. (And yes, I
clearly consider autovacuum to be just one kind of background process).

I dunno. It was just a thought. I haven't actually looked at the
code to see how much synergy there is. (Sorry, been really busy...)

I agree with this criticism, but the other thing that strikes me as a
nonstarter is having the postmaster participate in the imessages
framework.

This is simply not the case (anymore). (And one of the reasons a separate
coordinator process is required, instead of letting the postmaster do this
kind of coordination).

Oh, OK. I see now that I misinterpreted what you wrote.

On the more general topic of imessages, I had one other thought that
might be worth considering. Instead of using shared memory, what
about using a file that is shared between the sender and receiver? So
for example, perhaps each receiver will read messages from a file
called pg_messages/%d, where %d is the backend ID. And writers will
write into that file. Perhaps both readers and writers mmap() the
file, or perhaps there's a way to make it work with just read() and
write(). If you actually mmap() the file, you could probably manage
it in a fashion pretty similar to what you had in mind for wamalloc,
or some other setup that minimizes locking. In particular, ISTM that
if we want this to be usable for parallel query, we'll want to be able
to have one process streaming data in while another process streams
data out, with minimal interference between these two activities. On
the other hand, for processes that only send and receive messages
occasionally, this might just be overkill (and overhead). You'd be
just as well off wrapping the access to the file in an LWLock: the
reader takes the lock, reads the data, marks it read, and releases the
lock. The writer takes the lock, writes data, and releases the lock.

It almost seems to me that there are two different kinds of messages
here: control messages and data messages. Control messages are things
like "vacuum this database!" or "flush your cache!" or "execute this
query and send the results to backend %d!" or "cancel the currently
executing query!". They are relatively small (in some cases,
fixed-size), relatively low-volume, don't need complex locking, and
can generally be processed serially but with high priority. Data
messages are streams of tuples, either from a remote database from
which we are replicating, or between backends that are executing a
parallel query. These messages may be very large and extremely
high-volume, are very sensitive to concurrency problems, but are not
high-priority. We want to process them as quickly as possible, of
course, but the work may get interrupted by control messages. Another
point is that it's reasonable, at least in the case of parallel query,
for the action of sending a data message to *block*. If one part of
the query is too far ahead of the rest of the query, we don't want to
queue up results forever, perhaps using CPU or I/O resources that some
other backend needs to catch up, exhausting available disk space, etc.
Instead, at some point, we just block and wait for the queue to
drain. I suppose there's no avoiding the possibility that sending a
control message might also block, but certainly we wouldn't like a
control message to block because the relevant queue is full of data
messages.

So I kind of wonder whether we ought to have two separate systems, one
for data and one for control, with someone different characteristics.
I notice that one of your bg worker patches is for OOO-messages. I
apologize again for not having read through it, but how much does that
resemble separating the control and data channels?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#11Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#10)
Re: bg worker: patch 1 of 6 - permanent process

Robert,

On 08/26/2010 02:44 PM, Robert Haas wrote:

I dunno. It was just a thought. I haven't actually looked at the
code to see how much synergy there is. (Sorry, been really busy...)

No problem, was just wondering if there's any benefit you had in mind.

On the more general topic of imessages, I had one other thought that
might be worth considering. Instead of using shared memory, what
about using a file that is shared between the sender and receiver?

What would that buy us? (At the price of more system calls and disk
I/O)? Remember that the current approach (IIRC) uses exactly one syscall
to send a message: kill() to send the (multiplexed) signal. (Except on
strange platforms or setups that don't have a user-space spinlock
implementation and need to use system mutexes).

So
for example, perhaps each receiver will read messages from a file
called pg_messages/%d, where %d is the backend ID. And writers will
write into that file. Perhaps both readers and writers mmap() the
file, or perhaps there's a way to make it work with just read() and
write(). If you actually mmap() the file, you could probably manage
it in a fashion pretty similar to what you had in mind for wamalloc,
or some other setup that minimizes locking.

That would still require proper locking, then. So I'm not seeing the
benefit.

In particular, ISTM that
if we want this to be usable for parallel query, we'll want to be able
to have one process streaming data in while another process streams
data out, with minimal interference between these two activities.

That's well possible with the current approach. About the only
limitation is that a receiver can only consume the messages in the order
they got into the queue. But pretty much any backend can send messages
to any other backend concurrently.

(Well, except that I think there currently are bugs in wamalloc).

On
the other hand, for processes that only send and receive messages
occasionally, this might just be overkill (and overhead). You'd be
just as well off wrapping the access to the file in an LWLock: the
reader takes the lock, reads the data, marks it read, and releases the
lock. The writer takes the lock, writes data, and releases the lock.

The current approach uses plain spinlocks, which are more efficient.
Note that both, appending as well as removing from the queue are writing
operations, from the point of view of the queue. So I don't think
LWLocks buy you anything here, either.

It almost seems to me that there are two different kinds of messages
here: control messages and data messages. Control messages are things
like "vacuum this database!" or "flush your cache!" or "execute this
query and send the results to backend %d!" or "cancel the currently
executing query!". They are relatively small (in some cases,
fixed-size), relatively low-volume, don't need complex locking, and
can generally be processed serially but with high priority. Data
messages are streams of tuples, either from a remote database from
which we are replicating, or between backends that are executing a
parallel query. These messages may be very large and extremely
high-volume, are very sensitive to concurrency problems, but are not
high-priority. We want to process them as quickly as possible, of
course, but the work may get interrupted by control messages. Another
point is that it's reasonable, at least in the case of parallel query,
for the action of sending a data message to *block*. If one part of
the query is too far ahead of the rest of the query, we don't want to
queue up results forever, perhaps using CPU or I/O resources that some
other backend needs to catch up, exhausting available disk space, etc.

I agree that such a thing isn't currently covered. And it might be
useful. However, adding two separate queues with different priority
would be very simple to do. (Note, however, that there already are the
standard unix signals for very simple kinds of control signals. I.e. for
aborting a parallel query, you could simply send SIGINT to all
background workers involved).

I understand the need to limit the amount of data in flight, but I don't
think that sending any type of message should ever block. Messages are
atomic in that regard. Either they are ready to be delivered (in
entirety) or not. Thus the sender needs to hold back the message, if the
recipient is overloaded. (Also note that currently imessages are bound
to a maximum size of around 8 KB).

It might be interesting to note that I've just implemented some kind of
streaming mechanism *atop* of imessages for Postgres-R. A data stream
gets fragmented into single messages. As you pointed out, there should
be some kind of congestion control. However, in my case, that needs to
cover the inter-node connection as well, not just imessages. So I think
the solution to that problem needs to be found on a higher level. I.e.
in the Postgres-R case, I want to limit the *overall* amount of recovery
data that's pending for a certain node. Not just the amount that's
pending on a certain stream of within the imessages system.

Think of imessages as the IP between processes, while streaming of data
needs something akin to TCP on top of it. (OTOH, this comparison is
lacking, because imessages guarantee reliable and ordered delivery of
messages).

BTW: why do you think the data heavy messages are sensitive to
concurrency problems? I found the control messages to be rather more
sensitive, as state changes and timing for those control messages are
trickier to deal with.

So I kind of wonder whether we ought to have two separate systems, one
for data and one for control, with someone different characteristics.
I notice that one of your bg worker patches is for OOO-messages. I
apologize again for not having read through it, but how much does that
resemble separating the control and data channels?

It's something that resides within the coordinator process exclusively
and doesn't have much to do with imessages. Postgres-R doesn't require
the GCS to deliver (certain kind of) messages in any order, it only
requires the GCS to guarantee reliability of message delivery (or
notification in the form of excluding the failing node from the group in
case delivery failed).

Thus, the coordinator needs to be able to re-order the messages, because
bg workers need to receive the change sets in the correct order. And
imessages guarantees to maintain the ordering.

The reason for doing this within the coordinator is to a) lower
requirements for the GCS and b) gain more control of the data flow. I.e.
congestion control gets much easier, if the coordinator knows the amount
of data that's queued. (As opposed to having lots of TCP connections,
each of which queues an unknown amount of data).

As is evident, all of these decisions are rather Postgres-R centric.
However, I still think the simplicity and the level of generalization of
imessages, dynamic shared memory and to some extent even the background
worker infrastructure makes these components potentionaly re-usable.

Regards

Markus Wanner

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Markus Wanner (#11)
Re: bg worker: patch 1 of 6 - permanent process

Markus Wanner <markus@bluegap.ch> writes:

On 08/26/2010 02:44 PM, Robert Haas wrote:

On the more general topic of imessages, I had one other thought that
might be worth considering. Instead of using shared memory, what
about using a file that is shared between the sender and receiver?

What would that buy us?

Not having to have a hard limit on the space for unconsumed messages?

The current approach uses plain spinlocks, which are more efficient.

Please note the coding rule that says that the code should not execute
more than a few straight-line instructions while holding a spinlock.
If you're copying long messages while holding the lock, I don't think
spinlocks are acceptable.

regards, tom lane

#13Markus Wanner
markus@bluegap.ch
In reply to: Tom Lane (#12)
Re: bg worker: patch 1 of 6 - permanent process

On 08/26/2010 09:22 PM, Tom Lane wrote:

Not having to have a hard limit on the space for unconsumed messages?

Ah, I see. However, spilling to disk is unwanted for the current use
cases of imessages. Instead the sender needs to be able to deal with
out-of-(that-specific-part-of-shared)-memory conditions.

Please note the coding rule that says that the code should not execute
more than a few straight-line instructions while holding a spinlock.
If you're copying long messages while holding the lock, I don't think
spinlocks are acceptable.

Writing the payload data for imessages to shared memory doesn't need any
kind of lock. (Because the relevant chunk of shared memory got allocated
via wamalloc, which grants the allocator exclusive control over the
returned chunk). Only appending and removing (the pointer to the data)
to and from the queue requires taking a spinlock. And I think that still
qualifies.

However, your concern is valid for wamalloc, which is more critical in
that regard.

Regards

Markus

#14Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#11)
Re: bg worker: patch 1 of 6 - permanent process

On Thu, Aug 26, 2010 at 3:03 PM, Markus Wanner <markus@bluegap.ch> wrote:

On the more general topic of imessages, I had one other thought that
might be worth considering.  Instead of using shared memory, what
about using a file that is shared between the sender and receiver?

What would that buy us? (At the price of more system calls and disk I/O)?
Remember that the current approach (IIRC) uses exactly one syscall to send a
message: kill() to send the (multiplexed) signal. (Except on strange
platforms or setups that don't have a user-space spinlock implementation and
need to use system mutexes).

It wouldn't require you to preallocate a big chunk of shared memory
without knowing how much of it you'll actually need. For example,
suppose we implement parallel query. If the message queues can be
allocated on the fly, then you can just say
maximum_message_queue_size_per_backend = 16MB and that'll probably be
good enough for most installations. On systems where parallel query
is not used (e.g. because they only have 1 or 2 processors) then it
costs nothing. On systems where parallel query is used extensively
(e.g. because they have 32 processors), you'll allocate enough space
for the number of backends that actually need message buffers, and not
more than that. Furthermore, if parallel query is used at some times
(say, for nightly reporting) but not others (say, for daily OLTP
queries), the buffers can be deallocated when the helper backends exit
(or paged out if they are idle), and that memory can be reclaimed for
other use.

In addition, it means that maximum_message_queue_size_per_backend (or
whatever it's called) can be changed on-the-fly; that is, it can be
PGC_SIGHUP rather than PGC_POSTMASTER. Being able to change GUCs
without shutting down the postmaster is a *big deal* for people
running in 24x7 operations. Even things like wal_level that aren't
apt to be changed more than once in a blue moon are a problem (once
you go from "not having a standby" to "having a standby", you're
unlikely to want to go backwards), and this would likely need more
tweaking. You might find that you need more memory for better
throughput, or that you need to reclaim memory for other purposes.
Especially if it's a hard allocation for any number of backends,
rather than something that backends can allocate only as and when they
need it.

As to efficiency, the process is not much different once the initial
setup is completed. Just because you write to a memory-mapped file
rather than a shared memory segment doesn't mean that you're
necessarily doing disk I/O. On systems that support it, you could
also choose to map a named POSIX shm rather than a disk file. Either
way, there might be a little more overhead at startup but that doesn't
seem so bad; presumably the amount of work that the worker is doing is
large compared to the overhead of a few system calls, or you're
probably in trouble anyway, since our process startup overhead is
pretty substantial already. The only time it seems like the overhead
would be annoying is if a process is going to use this system, but
only lightly. Doing the extra setup just to send one or two messages
might suck. But maybe that just means this isn't the right mechanism
for those cases (e.g. the existing XID-wraparound logic should still
use signal multiplexing rather than this system). I see the value of
this as being primarily for streaming big chunks of data, not so much
for sending individual, very short messages.

On
the other hand, for processes that only send and receive messages
occasionally, this might just be overkill (and overhead).  You'd be
just as well off wrapping the access to the file in an LWLock: the
reader takes the lock, reads the data, marks it read, and releases the
lock.  The writer takes the lock, writes data, and releases the lock.

The current approach uses plain spinlocks, which are more efficient. Note
that both, appending as well as removing from the queue are writing
operations, from the point of view of the queue. So I don't think LWLocks
buy you anything here, either.

I agree that this might not be useful. We don't really have all the
message types defined yet, though, so it's hard to say.

I understand the need to limit the amount of data in flight, but I don't
think that sending any type of message should ever block. Messages are
atomic in that regard. Either they are ready to be delivered (in entirety)
or not. Thus the sender needs to hold back the message, if the recipient is
overloaded. (Also note that currently imessages are bound to a maximum size
of around 8 KB).

That's functionally equivalent to blocking, isn't it? I think that's
just a question of what API you want to expose.

It might be interesting to note that I've just implemented some kind of
streaming mechanism *atop* of imessages for Postgres-R. A data stream gets
fragmented into single messages. As you pointed out, there should be some
kind of congestion control. However, in my case, that needs to cover the
inter-node connection as well, not just imessages. So I think the solution
to that problem needs to be found on a higher level. I.e. in the Postgres-R
case, I want to limit the *overall* amount of recovery data that's pending
for a certain node. Not just the amount that's pending on a certain stream
of within the imessages system.

For replication, that might be the case, but for parallel query,
per-queue seems about right. At any rate, no design we've discussed
will let individual queues grow without bound.

Think of imessages as the IP between processes, while streaming of data
needs something akin to TCP on top of it. (OTOH, this comparison is lacking,
because imessages guarantee reliable and ordered delivery of messages).

You probably need this, but 8KB seems like a pretty small chunk size.
I think one of the advantages of a per-backend area is that you don't
need to worry so much about fragmentation. If you only need in-order
message delivery, you can just use the whole thing as a big ring
buffer. There's no padding or sophisticated allocation needed. You
just need a pointer to the last byte read (P1), the last byte allowed
to be read (P2), and the last byte allocated (P3). Writers take a
spinlock, advance P3, release the spinlock, write the message, take
the spinlock, advance P2, release the spinlock, and signal the reader.
Readers take the spinlock, read P1 and P2, release the spinlock, read
the data, take the spinlock, advance P1, and release the spinlock.

You might still want to fragment chunks of data to avoid problems if,
say, two writers are streaming data to a single reader. In that case,
if the messages were too large compared to the amount of buffer space
available, you might get poor utilization, or even starvation. But I
would think you wouldn't need to worry about that until the message
size got fairly high.

BTW: why do you think the data heavy messages are sensitive to concurrency
problems? I found the control messages to be rather more sensitive, as state
changes and timing for those control messages are trickier to deal with.

Well, what I was thinking about is the fact that data messages are
bigger. If I'm writing a 16-byte message once a minute and the reader
and I block each other until the message is fully read or written,
it's not really that big of a deal. If the same thing happens when
we're trying to continuously stream tuple data from one process to
another, it halves the throughput; we expect both processes to be
reading/writing almost constantly.

So I kind of wonder whether we ought to have two separate systems, one
for data and one for control, with someone different characteristics.
I notice that one of your bg worker patches is for OOO-messages.  I
apologize again for not having read through it, but how much does that
resemble separating the control and data channels?

It's something that resides within the coordinator process exclusively and
doesn't have much to do with imessages.

Oh, OK.

As is evident, all of these decisions are rather Postgres-R centric.
However, I still think the simplicity and the level of generalization of
imessages, dynamic shared memory and to some extent even the background
worker infrastructure makes these components potentionaly re-usable.

I think unicast messaging is really useful and I really want it, but
the requirement that it be done through dynamic shared memory
allocations feels very uncomfortable to me (as you've no doubt
gathered).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#15Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#13)
Re: bg worker: patch 1 of 6 - permanent process

On Thu, Aug 26, 2010 at 3:40 PM, Markus Wanner <markus@bluegap.ch> wrote:

On 08/26/2010 09:22 PM, Tom Lane wrote:

Not having to have a hard limit on the space for unconsumed messages?

Ah, I see. However, spilling to disk is unwanted for the current use cases
of imessages. Instead the sender needs to be able to deal with
out-of-(that-specific-part-of-shared)-memory conditions.

Shared memory can be paged out, too, if it's not being used enough to
keep the OS from deciding to evict it. And I/O to a mmap()'d file or
shared memory region can remain in RAM.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#16Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#14)
Re: bg worker: patch 1 of 6 - permanent process

Hi,

On 08/26/2010 11:57 PM, Robert Haas wrote:

It wouldn't require you to preallocate a big chunk of shared memory

Agreed, you wouldn't have to allocate it in advance. We would still want
a configurable upper limit. So this can be seen as another approach for
an implementation of a dynamic allocator. (Which should be separate from
the exact imessages implementation, just for the sake of modularization
already, IMO).

In addition, it means that maximum_message_queue_size_per_backend (or
whatever it's called) can be changed on-the-fly; that is, it can be
PGC_SIGHUP rather than PGC_POSTMASTER.

That's certainly a point. However, as you are proposing a solution to
just one subsystem (i.e. imessages), I don't find it half as convincing.

If you are saying it *should* be possible to resize shared memory in a
portable way, why not do it for *all* subsystems right away? I still
remember Tom saying it's not something that's doable in a portable way.
Why and how should it be possible for a per-backend basis? How portable
is mmap() really? Why don't we use in in Postgres as of now?

I certainly think that these are orthogonal issues: whether to use fixed
boundaries or to dynamically allocate the memory available is one thing,
dynamic resizing is another. If the later is possible, I'm certainly not
opposed to it. (But would still favor dynamic allocation).

As to efficiency, the process is not much different once the initial
setup is completed.

I fully agree to that.

I'm more concerned about ease of use for developers. Simply being able
to alloc() from shared memory makes things easier than having to invent
a separate allocation method for every subsystem, again and again (the
argument that people are more used to multi-threaded argument).

Doing the extra setup just to send one or two messages
might suck. But maybe that just means this isn't the right mechanism
for those cases (e.g. the existing XID-wraparound logic should still
use signal multiplexing rather than this system). I see the value of
this as being primarily for streaming big chunks of data, not so much
for sending individual, very short messages.

I agree that simple signals don't need a full imessage. But as soon as
you want to send some data (like which database to vacuum), or require
the delivery guarantee (i.e. no single message gets lost, as opposed to
signals), then imessages should be cheap enough.

The current approach uses plain spinlocks, which are more efficient. Note
that both, appending as well as removing from the queue are writing
operations, from the point of view of the queue. So I don't think LWLocks
buy you anything here, either.

I agree that this might not be useful. We don't really have all the
message types defined yet, though, so it's hard to say.

What does the type of lock used have to do with message types? IMO it
doesn't matter what kind of message or what size you want to send. For
appending or removing a pointer to or from a message queue, a spinlock
seems to be just the right thing to use.

I understand the need to limit the amount of data in flight, but I don't
think that sending any type of message should ever block. Messages are
atomic in that regard. Either they are ready to be delivered (in entirety)
or not. Thus the sender needs to hold back the message, if the recipient is
overloaded. (Also note that currently imessages are bound to a maximum size
of around 8 KB).

That's functionally equivalent to blocking, isn't it? I think that's
just a question of what API you want to expose.

Hm.. well, yeah, depends on what level you are arguing. The imessages
API can be used in a completely non-blocking fashion. So a process can
theoretically do other work while waiting for messages.

For parallel querying, the helper/worker backends would probably need to
block, if the origin backend is not ready to accept more data, yes.
However, making it accept and process another job in the mean time seems
hard to do. But not an imessages problem per se. (While with the above
streaming layer I've mentioned, that would not be possible, because that
blocks).

For replication, that might be the case, but for parallel query,
per-queue seems about right. At any rate, no design we've discussed
will let individual queues grow without bound.

Extend parallel querying to multiple nodes and you are back at the same
requirement.

However, it's certainly something that can be done atop imessages. I'm
unsure if doing it as part of imessages is a good thing or not. Given
the above requirement, I don't currently think so. Using multiple queues
with different priorities, as you proposed, would probably make it more
feasible.

You probably need this, but 8KB seems like a pretty small chunk size.

For node-internal messaging, I probably agree. Would need benchmarking,
as it's a compromise between latency and overhead, IMO.

I've chosen 8KB so these messages (together with some GCS and other
transport headers) presumably fit into ethernet jumbo frames. I'd argue
that you'd want even smaller chunk sizes for 1500 byte MTUs, because I
don't expect the GCS to do a better job at fragmenting, than we can do
in the upper layer (i.e. without copying data and w/o additional latency
when reassembling the packet). But again, maybe that should be
benchmarked, first.

I think one of the advantages of a per-backend area is that you don't
need to worry so much about fragmentation. If you only need in-order
message delivery, you can just use the whole thing as a big ring
buffer.

Hm.. interesting idea. It's similar to my initial implementation, except
that I had only a single ring-buffer for all backends.

There's no padding or sophisticated allocation needed. You
just need a pointer to the last byte read (P1), the last byte allowed
to be read (P2), and the last byte allocated (P3). Writers take a
spinlock, advance P3, release the spinlock, write the message, take
the spinlock, advance P2, release the spinlock, and signal the reader.

That would block parallel writers (i.e. only one process can write to
the queue at any time).

Readers take the spinlock, read P1 and P2, release the spinlock, read
the data, take the spinlock, advance P1, and release the spinlock.

It would require copying data in case a process only needs to forward
the message. That's a quick pointer dequeue and enqueue exercise ATM.

You might still want to fragment chunks of data to avoid problems if,
say, two writers are streaming data to a single reader. In that case,
if the messages were too large compared to the amount of buffer space
available, you might get poor utilization, or even starvation. But I
would think you wouldn't need to worry about that until the message
size got fairly high.

Some of the writers in Postgres-R allocate the chunk for the message in
shared memory way before they send the message. I.e. during a write
operation of a transaction that needs to be replicated, the backend
allocates space for a message at the start of the operation, but only
fills it with change set data during processing. That can possibly take
quite a while.

Decoupling memory allocation from message queue management allows to do
this without having to copy the data. The same holds true for forwarding
a message.

Well, what I was thinking about is the fact that data messages are
bigger. If I'm writing a 16-byte message once a minute and the reader
and I block each other until the message is fully read or written,
it's not really that big of a deal. If the same thing happens when
we're trying to continuously stream tuple data from one process to
another, it halves the throughput; we expect both processes to be
reading/writing almost constantly.

Agreed. Unlike the proposed ring-buffer approach, the separate allocator
approach doesn't have that problem, because writing itself is fully
parallelized, even to the same recipient.

I think unicast messaging is really useful and I really want it, but
the requirement that it be done through dynamic shared memory
allocations feels very uncomfortable to me (as you've no doubt
gathered).

Well, I on the other hand am utterly uncomfortable with having a
separate solution for memory allocation per sub-system (and it
definitely is an inherent problem to lots of our subsystems). Given the
ubiquity of dynamic memory allocators, I don't really understand your
discomfort.

Thanks for discussing, I always enjoy respectful disagreement.

Regards

Markus Wanner

#17Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#16)
Re: bg worker: patch 1 of 6 - permanent process

On Fri, Aug 27, 2010 at 2:17 PM, Markus Wanner <markus@bluegap.ch> wrote:

In addition, it means that maximum_message_queue_size_per_backend (or
whatever it's called) can be changed on-the-fly; that is, it can be
PGC_SIGHUP rather than PGC_POSTMASTER.

That's certainly a point. However, as you are proposing a solution to just
one subsystem (i.e. imessages), I don't find it half as convincing.

What other subsystems are you imagining servicing with a dynamic
allocator? If there were a big demand for this functionality, we
probably would have been forced to implement it already, but that's
not the case. We've already discussed the fact that there are massive
problems with using it for something like shared_buffers, which is by
far the largest consumer of shared memory.

If you are saying it *should* be possible to resize shared memory in a
portable way, why not do it for *all* subsystems right away? I still
remember Tom saying it's not something that's doable in a portable way.

I think it would be great if we could bring some more flexibility to
our memory management. There are really two layers of problems here.
One is resizing the segment itself, and one is resizing structures
within the segment. As far as I can tell, there is no portable API
that can be used to resize the shm itself. For so long as that
remains the case, I am of the opinion that any meaningful resizing of
the objects within the shm is basically unworkable. So we need to
solve that problem first.

There are a couple of possible solutions, which have been discussed
here in the past. One very appealing option is to use POSIX shm
rather than sysv shm. AFAICT, it is possible to portably resize a
POSIX shm using ftruncate(), though I am not sure to what extent this
is supported on Windows. One significant advantage of using POSIX shm
is that the default limits for POSIX shm on many operating systems are
much higher than the corresponding limits for sysv shm; in fact, some
people have expressed the opinion that it might be worth making the
switch for that reason alone, since it is no secret that a default
value of 32MB or less for shared_buffers is not enough to get
reasonable performance on many modern systems. I believe, however,
that Tom Lane thinks we need to get a bit more out of it than that to
make it worthwhile. One obstacle to making the switch is that POSIX
shm does not provide a way to fetch the number of processes attached
to the shared memory segment, which is a critical part of our
infrastructure to prevent accidentally running multiple postmasters on
the same data directory at the same time. Consequently, it seems hard
to see how we can make that switch completely. At a minimum, we'll
probably need to maintain a small sysv shm for interlock purposes.

OK, so let's suppose we use POSIX shm for most of the shared memory
segment, and keep only our fixed-size data structures in the sysv shm.
Then what? Well, then we can potentially resize it. Because we are
using a process-based model, this will require some careful
gymnastics. Let's say we're growing the shm. The backend that is
initiating the operation will call ftruncate() and then send signal
all of the other backends (using a sinval message or a multiplexed
signal or some such mechanism) to unmap and remap the shared memory
segment. Any failure to remap the shared memory segment is at least a
FATAL for that backend, and very likely a PANIC, so this had better
not be something we plan to do routinely - for example, we wouldn't
want to do this as a way of adapting to changing load conditions. It
would probably be acceptable to do it in a situation such as a
postgresql.conf reload, to accommodate a change in the server
parameter that can't otherwise be changed without a restart, since the
worst case scenario is, well, we have to restart anyway. Once all
that's done, it's safe to start allocating memory from the newly added
portion of the shm. Conversely, if we want to shrink the shm, the
considerations are similar, but we have to do everything in the
opposite order. First, we must ensure that the portion of the shm
we're about to release is unused. Then, we tell all the backends to
unmap and remap it. Once we've confirmed that they have done so, we
ftruncate() it to the new size.

Next, we have to think about how we're going to resize data structures
within this expandable shm. Many of these structures are not things
that we can easily move without bringing the system to a halt. For
example, it's difficult to see how you could change the base address
of shared buffers without ceasing all system activity, at which point
there's not really much advantage over just forcing a restart.
Similarly with LWLocks or the ProcArray. And if you can't move them,
then how will you grow them if (as will likely be the case) there's
something immediately following them in memory. One possible solution
is to divide up these data structures into "slabs". For example, we
might imagine allocating shared_buffers in 1GB chunks. To make this
work, we'd need to change the memory layout so that each chunk would
include all of the miscellaneous stuff that we need to do bookkeeping
for that chunk, such as the LWLocks and buffer descriptors. That
doesn't seem completely impossible, but there would be some
performance penalty, because you could no longer index into shared
buffers from a single base offset. Instead, you'd need to determine
which chunk contains the buffer you want, look up the base address for
that chunk, and then index into the chunk. Maybe that overhead
wouldn't be significant (or maybe it would); at any rate, it's not
completely free. There's also the problem of handling the partial
chunk at the end, especially if that happens to be the only chunk.

I think the problems for other arrays are similar, or more severe. I
can't see, for example, how you could resize the ProcArray using this
approach. If you want to deallocate a chunk of shared buffers, it's
not impossible to imagine an algorithm for relocating any dirty
buffers in the segment to be deallocated into the remaining available
space, and then chucking the ones that are not dirty. It might not be
real cheap, but that's not the same thing as not possible. On the
other hand, changing the backend ID of a process in flight seems
intractable. Maybe it's not. Or maybe there is some other approach
to resizing these data structures that can work, but it's not real
clear to me what it is.

So basically my feeling is that reworking our memory allocation in
general, while possibly worthwhile, is a whole lot of work. If we
focus on getting imessages done in the most direct fashion possible,
it seems like the sort of things that could get done in six months to
a year. If we take the approach of reworking our whole approach to
memory allocation first, I think it will take several years. Assuming
the problems discussed above aren't totally intractable, I'd be in
favor of solving them, because I think we can get some collateral
benefits out of it that would be nice to have. However, it's
definitely a much larger project.

Why
and how should it be possible for a per-backend basis?

If we're designing a completely new subsystem, we have a lot more
design flexibility, because we needn't worry about interactions with
the existing users of shared memory. Resizing an arena that is only
used for imessages is a lot more straightforward than resizing the
main shared memory arena. If you can't remap the main shared memory
chunk, you won't be able to properly clean up your state while
exiting, and so a PANIC is forced. But if you can't remap the
imessages chunk, and particularly if it only contains messages that
were addressed to you, then you should be able to get by with FATAL,
which is certainly a good thing from a system robustness point of
view. And you might not even need to remap it. The main reason
(although perhaps not the only reason) that someone would likely want
to vary a global allocation for parallel query or replication is if
they changed from "not using that feature" to "using it", or perhaps
from "using it" to "using it more heavily". If the allocations are
per-backend and can be made on the fly, that problem goes away.

As long as we keep the shared memory area used for imessages/dynamic
allocation separate from, and independent of, the main shm, we can
still gain many of the same advantages - in particular, not PANICing
if a remap fails, and being able to resize the thing on the fly.
However, I believe that the implementation will be more complex if the
area is not per-backend. Resizing is almost certainly a necessity in
this case, for the reasons discussed above, and that will have to be
done by having all backends unmap and remap the area in a coordinated
fashion, so it will be more disruptive than unmapping and remapping a
message queue for a single backend, where you only need to worry about
the readers and writers for that particular queue. Also, you now have
to worry about fragmentation: a simple ring buffer is great if you're
processing messages on a FIFO basis, but when you have multiple
streams of messages with different destinations, it's probably not a
great solution.

How portable is
mmap() really? Why don't we use in in Postgres as of now?

I believe that mmap() is very portable, though there are other people
on this list who know more about exotic, crufty platforms than I do.
I discussed the question of why it's not used for our current shared
memory segment above - no nattch interlock.

As to efficiency, the process is not much different once the initial
setup is completed.

I fully agree to that.

I'm more concerned about ease of use for developers. Simply being able to
alloc() from shared memory makes things easier than having to invent a
separate allocation method for every subsystem, again and again (the
argument that people are more used to multi-threaded argument).

This goes back to my points further up: what else do you think this
could be used for? I'm much less optimistic about this being reusable
than you are, and I'd like to hear some concrete examples of other use
cases.

The current approach uses plain spinlocks, which are more efficient. Note
that both, appending as well as removing from the queue are writing
operations, from the point of view of the queue. So I don't think LWLocks
buy you anything here, either.

I agree that this might not be useful.  We don't really have all the
message types defined yet, though, so it's hard to say.

What does the type of lock used have to do with message types? IMO it
doesn't matter what kind of message or what size you want to send. For
appending or removing a pointer to or from a message queue, a spinlock seems
to be just the right thing to use.

Well, it's certainly nice, if you can make it work. I haven't really
thought about all the cases, though. The main advantages of LWLocks
is that you can take them in either shared or exclusive mode, and that
you can hold them for more than a handful of instructions. If we're
trying to design a really *simple* system for message passing, LWLocks
might be just right. Take the lock, read or write the message,
release the lock. But it seems like that's not really the case we're
trying to optimize for, so this may be a dead-end.

You probably need this, but 8KB seems like a pretty small chunk size.

For node-internal messaging, I probably agree. Would need benchmarking, as
it's a compromise between latency and overhead, IMO.

I've chosen 8KB so these messages (together with some GCS and other
transport headers) presumably fit into ethernet jumbo frames. I'd argue that
you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't
expect the GCS to do a better job at fragmenting, than we can do in the
upper layer (i.e. without copying data and w/o additional latency when
reassembling the packet). But again, maybe that should be benchmarked,
first.

Yeah, probably. I think designing something that works efficiently
over a network is a somewhat different problem than designing
something that works on an individual node, and we probably shouldn't
let the designs influence each other too much.

There's no padding or sophisticated allocation needed.  You
just need a pointer to the last byte read (P1), the last byte allowed
to be read (P2), and the last byte allocated (P3).  Writers take a
spinlock, advance P3, release the spinlock, write the message, take
the spinlock, advance P2, release the spinlock, and signal the reader.

That would block parallel writers (i.e. only one process can write to the
queue at any time).

I feel like there's probably some variant of this idea that works
around that problem. The problem is that when a worker finishes
writing a message, he needs to know whether to advance P2 only over
his own message or also over some subsequent message that has been
fully written in the meantime. I don't know exactly how to solve that
problem off the top of my head, but it seems like it might be
possible.

Readers take the spinlock, read P1 and P2, release the spinlock, read
the data, take the spinlock, advance P1, and release the spinlock.

It would require copying data in case a process only needs to forward the
message. That's a quick pointer dequeue and enqueue exercise ATM.

If we need to do that, that's a compelling argument for having a
single messaging area rather than one per backend. But I'm not sure I
see why we would need that sort of capability. Why wouldn't you just
arrange for the sender to deliver the message directly to the final
recipient?

You might still want to fragment chunks of data to avoid problems if,
say, two writers are streaming data to a single reader.  In that case,
if the messages were too large compared to the amount of buffer space
available, you might get poor utilization, or even starvation.  But I
would think you wouldn't need to worry about that until the message
size got fairly high.

Some of the writers in Postgres-R allocate the chunk for the message in
shared memory way before they send the message. I.e. during a write
operation of a transaction that needs to be replicated, the backend
allocates space for a message at the start of the operation, but only fills
it with change set data during processing. That can possibly take quite a
while.

So, they know in advance how large the message will be but not what
the contents will be? What are they doing?

I think unicast messaging is really useful and I really want it, but
the requirement that it be done through dynamic shared memory
allocations feels very uncomfortable to me (as you've no doubt
gathered).

Well, I on the other hand am utterly uncomfortable with having a separate
solution for memory allocation per sub-system (and it definitely is an
inherent problem to lots of our subsystems). Given the ubiquity of dynamic
memory allocators, I don't really understand your discomfort.

Well, the fact that something is commonly used doesn't mean it's right
for us. Tabula raza, we might design the whole system differently,
but changing it now is not to be undertaken lightly. Hopefully the
above comments shed some light on my concerns. In short, (1) I don't
want to preallocate a big chunk of memory we might not use, (2) I fear
reducing the overall robustness of the system, and (3) I'm uncertain
what other systems would be able leverage a dynamic allocator of the
sort you propose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#18Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#17)
Re: bg worker: patch 1 of 6 - permanent process

Hi,

On 08/27/2010 10:46 PM, Robert Haas wrote:

What other subsystems are you imagining servicing with a dynamic
allocator? If there were a big demand for this functionality, we
probably would have been forced to implement it already, but that's
not the case. We've already discussed the fact that there are massive
problems with using it for something like shared_buffers, which is by
far the largest consumer of shared memory.

Understood. I certainly plan to look into that for a better
understanding of the problems those pose for dynamically allocated memory.

I think it would be great if we could bring some more flexibility to
our memory management. There are really two layers of problems here.

Full ACK.

One is resizing the segment itself, and one is resizing structures
within the segment. As far as I can tell, there is no portable API
that can be used to resize the shm itself. For so long as that
remains the case, I am of the opinion that any meaningful resizing of
the objects within the shm is basically unworkable. So we need to
solve that problem first.

Why should resizing of the objects within the shmem be unworkable?
Doesn't my patch(es) prove the exact opposite? Being able to resize
"objects" within the shm requires some kind of underlying dynamic
allocation. And I rather like to be in control of that allocator than
having to deal with two dozen different implementations on different
OSes and their libraries.

There are a couple of possible solutions, which have been discussed
here in the past.

I currently don't have much interest in dynamic resizing. Being able to
resize the overall amount of shared memory on the fly would be nice,
sure. But the total amount of RAM in a server changes rather
infrequently. Being able to use what's available more efficiently is
what I'm interested in. That doesn't need any kind of additional or
different OS level support. It's just a matter of making better use of
what's available - within Postgres itself.

Next, we have to think about how we're going to resize data structures
within this expandable shm.

Okay, that's where I'm getting interested.

Many of these structures are not things
that we can easily move without bringing the system to a halt. For
example, it's difficult to see how you could change the base address
of shared buffers without ceasing all system activity, at which point
there's not really much advantage over just forcing a restart.
Similarly with LWLocks or the ProcArray.

I guess that's what Bruce wanted to point out by saying our data
structures are mostly "continuous". I.e. not dynamic lists or hash
tables, but plain simple arrays.

Maybe that's a subjective impression, but I seem to hear complaints
about their fixed size and inflexibility quite often. Try to imagine the
flexibility that dynamic lists could give us.

And if you can't move them,
then how will you grow them if (as will likely be the case) there's
something immediately following them in memory. One possible solution
is to divide up these data structures into "slabs". For example, we
might imagine allocating shared_buffers in 1GB chunks.

Why 1GB and do yet another layer of dynamic allocation within that? The
buffers are (by default) 8K, so allocate in chunks of 8K. Or a tiny bit
more for all of the book-keeping stuff.

To make this
work, we'd need to change the memory layout so that each chunk would
include all of the miscellaneous stuff that we need to do bookkeeping
for that chunk, such as the LWLocks and buffer descriptors. That
doesn't seem completely impossible, but there would be some
performance penalty, because you could no longer index into shared
buffers from a single base offset.

AFAICT we currently have three fixed size blocks to manage shared
buffers: the buffer blocks themselves, the buffer descriptors, the
strategy status (for the freelist) and the buffer lookup table.

It's not obvious to me how these data structures should perform better
than a dynamically allocated layout. One could rather argue that
combining (some of) the bookkeeping stuff with data itself would lead to
better locality and thus perform better.

Instead, you'd need to determine
which chunk contains the buffer you want, look up the base address for
that chunk, and then index into the chunk. Maybe that overhead
wouldn't be significant (or maybe it would); at any rate, it's not
completely free. There's also the problem of handling the partial
chunk at the end, especially if that happens to be the only chunk.

This sounds way too complicated, yes. Use 8K chunks and most of the
problems vanish.

I think the problems for other arrays are similar, or more severe. I
can't see, for example, how you could resize the ProcArray using this
approach.

Try not to think in terms of resizing, but dynamic allocation. I.e.
being able to resize ProcArray (and thus being able to alter
max_connections on the fly) would take a lot more work.

Just using the unoccupied space of the ProcArray for other subsystems
that need it more urgently could be done much easier. Again, you'd want
to allocate a single PGPROC at a time.

(And yes, the benefits aren't as significant as for shared_buffers,
simply because PGPROC doesn't occupy that much memory).

If you want to deallocate a chunk of shared buffers, it's
not impossible to imagine an algorithm for relocating any dirty
buffers in the segment to be deallocated into the remaining available
space, and then chucking the ones that are not dirty.

Please use the dynamic allocator for that. Don't duplicate that again.
Those allocators are designed for efficiently allocating small chunks,
down to a few bytes.

It might not be
real cheap, but that's not the same thing as not possible. On the
other hand, changing the backend ID of a process in flight seems
intractable. Maybe it's not. Or maybe there is some other approach
to resizing these data structures that can work, but it's not real
clear to me what it is.

Changing to a dynamically allocated memory model certainly requires some
thought and lots of work. Yes. It's not for free.

So basically my feeling is that reworking our memory allocation in
general, while possibly worthwhile, is a whole lot of work.

Exactly.

If we
focus on getting imessages done in the most direct fashion possible,
it seems like the sort of things that could get done in six months to
a year.

Well, it works for Postgres-R as it is, so imessages already exists
without a single additional month. And I don't intend to change it back
to something that couldn't use a dynamic allocator. I already run into
too many problems that way, see below.

If we take the approach of reworking our whole approach to
memory allocation first, I think it will take several years. Assuming
the problems discussed above aren't totally intractable, I'd be in
favor of solving them, because I think we can get some collateral
benefits out of it that would be nice to have. However, it's
definitely a much larger project.

Agreed.

If the allocations are
per-backend and can be made on the fly, that problem goes away.

That might hold true for imessages, which simply loose importance once
the (recipient) backend vanishes. But other shared memory stuff, that
would rather complicate shared memory access.

As long as we keep the shared memory area used for imessages/dynamic
allocation separate from, and independent of, the main shm, we can
still gain many of the same advantages - in particular, not PANICing
if a remap fails, and being able to resize the thing on the fly.

Separate sub-system allocators, separate code, separate bugs, lots more
work. Please not. KISS.

However, I believe that the implementation will be more complex if the
area is not per-backend. Resizing is almost certainly a necessity in
this case, for the reasons discussed above

I disagree and see the main reason in making better use of the available
resources. Resizing will loose lots of importance, once you can
dynamically adjust boundaries between subsystem's use of the single,
huge, fixed-size shmem chunk allocated at start.

and that will have to be
done by having all backends unmap and remap the area in a coordinated
fashion,

That's assuming resizing capability.

so it will be more disruptive than unmapping and remapping a
message queue for a single backend, where you only need to worry about
the readers and writers for that particular queue.

And that's assuming a separate allocation method for the imessages
sub-system.

Also, you now have
to worry about fragmentation: a simple ring buffer is great if you're
processing messages on a FIFO basis, but when you have multiple
streams of messages with different destinations, it's probably not a
great solution.

Exactly, that's where dynamic allocation shows its real advantages. No
silly ring buffers required.

This goes back to my points further up: what else do you think this
could be used for? I'm much less optimistic about this being reusable
than you are, and I'd like to hear some concrete examples of other use
cases.

Sure. And well understood. I'll try to take a try at converting
shared_buffers.

Well, it's certainly nice, if you can make it work. I haven't really
thought about all the cases, though. The main advantages of LWLocks
is that you can take them in either shared or exclusive mode

As mentioned, the message queue has write accesses exclusively (enqueue
and dequeue), so that's unneeded overhead.

and that
you can hold them for more than a handful of instructions.

Neither of the two operations needs more than a handful of instructions,
so that's plain overhead as well.

If we're
trying to design a really *simple* system for message passing, LWLocks
might be just right. Take the lock, read or write the message,
release the lock.

That's exactly how easy is is *with* the dynamic allocator: take the
(even simpler) spin lock, enqueue (or dequeue) the message, release the
lock again.

No locking required for writing or reading the message. Independent (and
well multi-process capable / safe) alloc and free routines for memory
management. That get called *before* writing the message and *after*
reading it.

Mangling memory allocation with queue management is a lot more
complicated to design and understand. And less efficient

Show quoted text

But it seems like that's not really the case we're
trying to optimize for, so this may be a dead-end.

You probably need this, but 8KB seems like a pretty small chunk size.

For node-internal messaging, I probably agree. Would need benchmarking, as
it's a compromise between latency and overhead, IMO.

I've chosen 8KB so these messages (together with some GCS and other
transport headers) presumably fit into ethernet jumbo frames. I'd argue that
you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't
expect the GCS to do a better job at fragmenting, than we can do in the
upper layer (i.e. without copying data and w/o additional latency when
reassembling the packet). But again, maybe that should be benchmarked,
first.

Yeah, probably. I think designing something that works efficiently
over a network is a somewhat different problem than designing
something that works on an individual node, and we probably shouldn't
let the designs influence each other too much.

There's no padding or sophisticated allocation needed. You
just need a pointer to the last byte read (P1), the last byte allowed
to be read (P2), and the last byte allocated (P3). Writers take a
spinlock, advance P3, release the spinlock, write the message, take
the spinlock, advance P2, release the spinlock, and signal the reader.

That would block parallel writers (i.e. only one process can write to the
queue at any time).

I feel like there's probably some variant of this idea that works
around that problem. The problem is that when a worker finishes
writing a message, he needs to know whether to advance P2 only over
his own message or also over some subsequent message that has been
fully written in the meantime. I don't know exactly how to solve that
problem off the top of my head, but it seems like it might be
possible.

Readers take the spinlock, read P1 and P2, release the spinlock, read
the data, take the spinlock, advance P1, and release the spinlock.

It would require copying data in case a process only needs to forward the
message. That's a quick pointer dequeue and enqueue exercise ATM.

If we need to do that, that's a compelling argument for having a
single messaging area rather than one per backend. But I'm not sure I
see why we would need that sort of capability. Why wouldn't you just
arrange for the sender to deliver the message directly to the final
recipient?

You might still want to fragment chunks of data to avoid problems if,
say, two writers are streaming data to a single reader. In that case,
if the messages were too large compared to the amount of buffer space
available, you might get poor utilization, or even starvation. But I
would think you wouldn't need to worry about that until the message
size got fairly high.

Some of the writers in Postgres-R allocate the chunk for the message in
shared memory way before they send the message. I.e. during a write
operation of a transaction that needs to be replicated, the backend
allocates space for a message at the start of the operation, but only fills
it with change set data during processing. That can possibly take quite a
while.

So, they know in advance how large the message will be but not what
the contents will be? What are they doing?

I think unicast messaging is really useful and I really want it, but
the requirement that it be done through dynamic shared memory
allocations feels very uncomfortable to me (as you've no doubt
gathered).

Well, I on the other hand am utterly uncomfortable with having a separate
solution for memory allocation per sub-system (and it definitely is an
inherent problem to lots of our subsystems). Given the ubiquity of dynamic
memory allocators, I don't really understand your discomfort.

Well, the fact that something is commonly used doesn't mean it's right
for us. Tabula raza, we might design the whole system differently,
but changing it now is not to be undertaken lightly. Hopefully the
above comments shed some light on my concerns. In short, (1) I don't
want to preallocate a big chunk of memory we might not use, (2) I fear
reducing the overall robustness of the system, and (3) I'm uncertain
what other systems would be able leverage a dynamic allocator of the
sort you propose.

#19Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#17)
Re: bg worker: patch 1 of 6 - permanent process

(Sorry, need to disable Ctrl-Return, which quite often sends mails
earlier than I really want.. continuing my mail)

On 08/27/2010 10:46 PM, Robert Haas wrote:

Yeah, probably. I think designing something that works efficiently
over a network is a somewhat different problem than designing
something that works on an individual node, and we probably shouldn't
let the designs influence each other too much.

Agreed. Thus I've left out any kind of congestion avoidance stuff from
imessages so far.

There's no padding or sophisticated allocation needed. You
just need a pointer to the last byte read (P1), the last byte allowed
to be read (P2), and the last byte allocated (P3). Writers take a
spinlock, advance P3, release the spinlock, write the message, take
the spinlock, advance P2, release the spinlock, and signal the reader.

That would block parallel writers (i.e. only one process can write to the
queue at any time).

I feel like there's probably some variant of this idea that works
around that problem. The problem is that when a worker finishes
writing a message, he needs to know whether to advance P2 only over
his own message or also over some subsequent message that has been
fully written in the meantime. I don't know exactly how to solve that
problem off the top of my head, but it seems like it might be
possible.

I've tried pretty much that before. And failed. Because the
allocation-order (i.e. the time the message gets created in preparation
for writing to it) isn't necessarily the same as the sending-order (i.e.
when the process has finished writing and decides to send the message).

To satisfy the FIFO property WRT the sending order, you need to decouple
allocation form the ordering (i.e. queuing logic).

(And yes, it has taken me a while to figure out what's wrong in
Postgres-R, before I've even noticed about that design bug).

Readers take the spinlock, read P1 and P2, release the spinlock, read
the data, take the spinlock, advance P1, and release the spinlock.

It would require copying data in case a process only needs to forward the
message. That's a quick pointer dequeue and enqueue exercise ATM.

If we need to do that, that's a compelling argument for having a
single messaging area rather than one per backend.

Absolutely, yes.

But I'm not sure I
see why we would need that sort of capability. Why wouldn't you just
arrange for the sender to deliver the message directly to the final
recipient?

A process can read and even change the data of the message before
forwarding it. Something the coordinator in Postgres-R does sometimes.
(As it is the interface to the GCS and thus to the rest of the nodes in
the cluster).

For parallel querying (on a single node) that's probably less important
a feature.

So, they know in advance how large the message will be but not what
the contents will be? What are they doing?

Filling the message until it's (mostly) full and then continue with the
next one. At least that's how the streaming approach on top of imessages
works.

But yes, it's somewhat annoying to have to know the message size in
advance. I didn't implement realloc so far. Nor can I think of any other
solution. Note that separation of allocation and queue ordering is
required anyway for the above reasons.

Well, the fact that something is commonly used doesn't mean it's right
for us. Tabula raza, we might design the whole system differently,
but changing it now is not to be undertaken lightly. Hopefully the
above comments shed some light on my concerns. In short, (1) I don't
want to preallocate a big chunk of memory we might not use,

Isn't that's exactly what we do now for lots of sub-systems, and what
I'd like to improve (i.e. reduce to a single big chunk).

(2) I fear
reducing the overall robustness of the system, and

Well, that applies to pretty much every new feature you add.

(3) I'm uncertain
what other systems would be able leverage a dynamic allocator of the
sort you propose.

Okay, that's up to me to show evidences (or at least a PoC).

Regards

Markus Wanner

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Markus Wanner (#18)
Re: bg worker: patch 1 of 6 - permanent process

Markus Wanner <markus@bluegap.ch> writes:

AFAICT we currently have three fixed size blocks to manage shared
buffers: the buffer blocks themselves, the buffer descriptors, the
strategy status (for the freelist) and the buffer lookup table.

It's not obvious to me how these data structures should perform better
than a dynamically allocated layout.

Let me just point out that awhile back we got a *measurable* performance
boost by eliminating a single indirect fetch from the buffer addressing
code path. We used to have an array of pointers pointing to the actual
buffers, and we removed that in favor of assuming the buffers were
laid out in a contiguous array, so that the address of buffer N could be
computed with a shift-and-add, eliminating the pointer fetch. I forget
exactly what the numbers were, but it was significant enough to make us
change it.

So I don't have any faith in untested assertions that we can convert
these data structures to use dynamic allocation with no penalty.
It's very difficult to see how you'd do that without introducing a
new layer of indirection, and our experience shows that that layer
will cost you.

regards, tom lane

#21Markus Wanner
markus@bluegap.ch
In reply to: Tom Lane (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#21)
#23Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#22)
#24Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Markus Wanner (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#24)
#26Markus Wanner
markus@bluegap.ch
In reply to: Tom Lane (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Markus Wanner (#26)
#28Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#25)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#28)
#30Markus Wanner
markus@bluegap.ch
In reply to: Tom Lane (#27)
#31Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#29)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#31)
#33Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#32)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#33)
#35Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#34)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#35)
#37Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#37)
#39Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#39)
#41Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#41)
#43Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#43)
#45Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#44)
#46Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#45)
#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#46)
#48tomas@tuxteam.de
tomas@tuxteam.de
In reply to: Robert Haas (#46)
#49Markus Wanner
markus@bluegap.ch
In reply to: Tom Lane (#47)
#50Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#46)
#51Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#50)
#52Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#51)
#53Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#52)
#54Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#53)
#55Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#54)
#56Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#55)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#56)
#58Markus Wanner
markus@bluegap.ch
In reply to: Robert Haas (#57)
#59Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#46)
#60Markus Wanner
markus@bluegap.ch
In reply to: Bruce Momjian (#59)