Changeset Extraction v7.0 (was logical changeset generation)
Hello Everyone,
I am pleased to announce the next version of the the changeset
extraction feature.
There's more changes than I can remember, but that's what comes to my
tired mind:
* Initial userlevel docs (including an example session!).
* generalization of the "replication slot" system to also work for
streaming rep, although the user interface for that is mostly missing.
* don't remove WAL still required by a replication slot, be it a slot for
changeset extraction or streaming rep.
* New output plugin interface with one _PG_output_plugin_init dlsym()ed
function filling out the callbacks.
* Renaming of the init and _cleanup output plugins to startup and shutdown.
* Simplification of the prepare_write/write interface for output plugins
(no need to specify LSNs anymore).
* Renaming of the changeset extraction operations to
create_replication_slot/drop_replication_slot/start_replication
... logical.
* moving the SQL changeset functions from a contrib module into core
* Addition of peeking functions for changeset extraction.
* revised error messages
* revised comments
* ...
I've followed Robert's wishes with generalizing the replication slot
interface to not only work for changeset generation, but also streaming
rep - not sure whether that was the right choice, it's been more work
than I expected blocking things a bit but we're there now....
There's no clientside support included except as in the pg_receivexlog
hack attached as the last patch, but I also have tested it via streaming
rep and it mostly works (minus a hot_standby_feedback bug, will report
tomorrow).
What I think is missing:
* The user docs need more work, even though we're in a much better state
than before.
* Replication slots are stored in binary files. I think it might make
sense to store them as text files instead, for easier
extensibility. Especially since we want to use them for streaming rep,
I am pretty sure new attributes will soon come. I don't think it's
critical enough performancwise to store them in binary.
* Contrary to what Robert and I'd discussed I've named the SQL functions
outputting changes decoding_slot_(get|peek)_[binary_]changes instead of
decoding_stream_* - I can change that, but the SQL functions don't
actually support streaming, so I thought that might be
confusing. Opinions?
* Robert complained earlier about the way historical catalog snapshots
are forced in tqual.c - I don't really know yet what the better way
would be here.
* Some functionality probably needs to move between the patches - it's a
bit hard to see where the best boundaries are.
The sources are in my git tree at:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
branch xlog-decoding-rebasing-remapping.
The last two patches are *not* indendet to be actually applied, but are
useful for testing.
If you want to test, you'll need a clean initdb, set wal_level=logical
and max_replication_slots>0. There's a example SQL session showing how
things can be used....
Testing, Review, Questions welcome!
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
This 0001 patch, to log running transactions more frequently, has been
pending for a long time now, and I haven't heard any objections, so
I've gone ahead and committed that part.
...Robert
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Review of patch 0002:
- I think you should just regard ReplicationSlotCtlLock as protecting
the "name" and "active" flags of every slot. ReplicationSlotCreate()
would then not need to mess with the spinlocks at all, and
ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
think. Functions like ReplicationSlotsComputeRequiredXmin() can
acquire this lock in shared mode and only lock the slots that are
actually active, instead of all of them.
- If you address /* FIXME: apply sanity checking to slot name */, then
I think that also addresses /* XXX: do we want to use truncate
identifier instead? */. In other words, let's just error out if the
name is too long. I'm not sure what other sanity checking is needed
here; maybe just restrict it to 7-bit characters to avoid problems
with encodings for individual databases varying.
- ReplicationSlotAcquire probably needs to ignore slots that are not active.
- ReplicationSlotAcquire should be tweaked so that the code that holds
the spinlock is more self-contained. If you adopt the above-proposed
recasting of ReplicationSlotCtlLock, then the part that holds the
spinlock can probably look like this: SpinLockAcquire(&slot->mutex);
was_active = slot->active; slot->active = true;
SpinLockRelease(&slot->mutex), which looks quite a bit safer.
- If there's a coding rule that slot->database can't be changed while
the slot is active, then the check to make sure that the user isn't
trying to bind to a slot with a mis-matching database could be done
before the code described in the previous point, avoiding the need to
go back and release the resource.
- I think the critical section in ReplicationSlotDrop is bogus. If
DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
gone.
- cancel_before_shmem_exit is only guaranteed to remove the
most-recently-added callback.
- Why does ReplicationSlotsComputeRequiredXmin() need to hold
ProcArrayLock at all?
- ReplicationSlotsComputeRequiredXmin scarcely needs to hold the
spinlock while it does all of those gyrations. It can just acquire
the spinlock, copy the three fields needed into local variables, and
release the spinlock. The rest can be worked out afterwards.
Similarly in ReplicationSlotsComputeRequiredXmin.
- A comment in KillSlot wonders whether locking is required. I say
yes. It's safe to take lwlocks and spinlocks during shmem exit, and
failing to do so seems like a recipe for subtle corner-case bugs.
- pg_get_replication_slots() wonders what permissions we require. I
don't know that any special permissions are needed here; the data
we're exposing doesn't appear to be sensitive. Unless I'm missing
something?
- PG_STAT_GET_LOGICAL_DECODING_SLOTS_COLS has a leftover "logical" in its name.
- There seems to be no interface to acquire or release slots from
either SQL or the replication protocol, nor any way for a client of
this code to update its slot details. The value of
catalog_xmin/data_xmin vs. effective_catalog_xmin/effective_data_xmin
is left to the imagination.
...Robert
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2014-01-15 13:28:25 -0500, Robert Haas wrote:
- I think you should just regard ReplicationSlotCtlLock as protecting
the "name" and "active" flags of every slot. ReplicationSlotCreate()
would then not need to mess with the spinlocks at all, and
ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
think. Functions like ReplicationSlotsComputeRequiredXmin() can
acquire this lock in shared mode and only lock the slots that are
actually active, instead of all of them.
I first thought you meant that we should get rid of the spinlock, but
after rereading I think you just mean that ->name, ->active, ->in_use
are only allowed to change while holding the lwlock exclusively so we
don't need to spinlock in those cases? If so, yes, that works for me.
- If you address /* FIXME: apply sanity checking to slot name */, then
I think that also addresses /* XXX: do we want to use truncate
identifier instead? */. In other words, let's just error out if the
name is too long. I'm not sure what other sanity checking is needed
here; maybe just restrict it to 7-bit characters to avoid problems
with encodings for individual databases varying.
Yea, erroring out seems like a good idea. But I think we need to
restrict slot names a bit more than that, given they are used as
filenames... We could instead name the files using the slot's offset,
but I'd prefer to not go that route.
- ReplicationSlotAcquire probably needs to ignore slots that are not active.
Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.
- If there's a coding rule that slot->database can't be changed while
the slot is active, then the check to make sure that the user isn't
trying to bind to a slot with a mis-matching database could be done
before the code described in the previous point, avoiding the need to
go back and release the resource.
I don't think slot->database should be allowed to change at all...
- I think the critical section in ReplicationSlotDrop is bogus. If
DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
gone.
Well, if delete slot fails, we don't really know at which point it
failed which means that the on-disk state might not correspond to the
in-memory state. I don't see a point in adding code trying to handle
that case correctly...
- cancel_before_shmem_exit is only guaranteed to remove the
most-recently-added callback.
Yea :(. I think that's safe for the current usages but seems mighty
fragile... Not sure what to do about it. Just register KillSlot once and
keep it registered?
- Why does ReplicationSlotsComputeRequiredXmin() need to hold
ProcArrayLock at all?
There's reasoning, but I just noticed that it's basis might be flawed
anyway :(.
When starting changeset extraction in a new slot we need to guarantee
that we only start decoding records we know the catalog tuples haven't
been removed for.
So, when creating the slot I've so far done a GetOldestXmin() and used
that to check against xl_running_xact->oldestRunningXid. But
GetOldestXmin() can go backwards...
I'll think a bit and try to simplify this.
- ReplicationSlotsComputeRequiredXmin scarcely needs to hold the
spinlock while it does all of those gyrations. It can just acquire
the spinlock, copy the three fields needed into local variables, and
release the spinlock. The rest can be worked out afterwards.
Similarly in ReplicationSlotsComputeRequiredXmin.
Yea, will change.
- A comment in KillSlot wonders whether locking is required. I say
yes. It's safe to take lwlocks and spinlocks during shmem exit, and
failing to do so seems like a recipe for subtle corner-case bugs.
I agree that it's safe to use spinlocks, but lwlocks? What if we are
erroring out while holding an lwlock? Code that runs inside a
TransactionCommand is protected against that, but if not ProcKill()
which invokes LWLockReleaseAll() runs pretty late in the teardown
process...
- pg_get_replication_slots() wonders what permissions we require. I
don't know that any special permissions are needed here; the data
we're exposing doesn't appear to be sensitive. Unless I'm missing
something?
I don't see a problem either, but it seems others have -
pg_stat_replication only displays minor amounts of information if one
doesn't have the replication privilege... Not sure what the reasoning
there is, and whether it applies here as well.
- There seems to be no interface to acquire or release slots from
either SQL or the replication protocol, nor any way for a client of
this code to update its slot details.
I don't think either ever needs to do that - START_TRANSACTION SLOT slot
...; and decoding_slot_*changes will acquire/release for them while
active. What would the usecase be to allow them to be acquired from SQL?
The slot details get updates by the respective replication code. For
streaming rep, that should happen via reply and feedback messages. For
changeset extraction it happens when LogicalConfirmReceivedLocation() is
called; the walsender interface does that using reply messages, the SQL
interface calls it when finished (unless you use the _peek_ functions).
The value of
catalog_xmin/data_xmin vs. effective_catalog_xmin/effective_data_xmin
is left to the imagination.
There's a comment about them in a following patch. Basically the reason
is that for changeset extraction we cannot adjust the in-memory value
before we know the changed slot status is safely synced to
disk. Otherwise a client could restart streaming at a LSN where the
corresponding catalog details are gone since it's not prevented by the
slot anymore.
That could be done by just holding some lock forbidding the global xmin
value to be recomputing while writing to disk, but that seems awfully
heavy-handed. So the protocol is:
1) update ->catalog_xmin to the new xmin,
2) sync slot to disk
3) set ->effective_catalog_xmin = ->catalog_xmin
4) ReplicationSlotsComputeRequiredXmin()
Since ComputeRequiredXmin() only looks at effective_catalog_xmin that
guarantees that the global xmin horizon doesn't increase before the the
slot has been synced to disk. If we crash after 2,
StartupReplicationSlots() simply sets effective_catalog_xmin =
catalog_xmin, we know it's safely on disk now since we've just read it
from there.
Thanks for committing 0001!
Regards,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 15, 2014 at 3:39 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-01-15 13:28:25 -0500, Robert Haas wrote:
- I think you should just regard ReplicationSlotCtlLock as protecting
the "name" and "active" flags of every slot. ReplicationSlotCreate()
would then not need to mess with the spinlocks at all, and
ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
think. Functions like ReplicationSlotsComputeRequiredXmin() can
acquire this lock in shared mode and only lock the slots that are
actually active, instead of all of them.I first thought you meant that we should get rid of the spinlock, but
after rereading I think you just mean that ->name, ->active, ->in_use
are only allowed to change while holding the lwlock exclusively so we
don't need to spinlock in those cases? If so, yes, that works for me.
Yeah, that's about what I had in mind.
- If you address /* FIXME: apply sanity checking to slot name */, then
I think that also addresses /* XXX: do we want to use truncate
identifier instead? */. In other words, let's just error out if the
name is too long. I'm not sure what other sanity checking is needed
here; maybe just restrict it to 7-bit characters to avoid problems
with encodings for individual databases varying.Yea, erroring out seems like a good idea. But I think we need to
restrict slot names a bit more than that, given they are used as
filenames... We could instead name the files using the slot's offset,
but I'd prefer to not go that route.
OK. Well, add some code, then. :-)
- ReplicationSlotAcquire probably needs to ignore slots that are not active.
Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.
active != in_use.
I suppose your point is that the slot can't be in_use if it's not also
active. Maybe it would be better to get rid of active/in_use and have
three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
REPLSLOT_FREE. Or something like that.
- If there's a coding rule that slot->database can't be changed while
the slot is active, then the check to make sure that the user isn't
trying to bind to a slot with a mis-matching database could be done
before the code described in the previous point, avoiding the need to
go back and release the resource.I don't think slot->database should be allowed to change at all...
Well, it can if the slot is dropped and a new one created.
- I think the critical section in ReplicationSlotDrop is bogus. If
DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
gone.Well, if delete slot fails, we don't really know at which point it
failed which means that the on-disk state might not correspond to the
in-memory state. I don't see a point in adding code trying to handle
that case correctly...
Deleting the slot should be an atomic operation. There's some
critical point before which the slot will be picked up by recovery and
after which it won't. You either did that operation, or not, and can
adjust the in-memory state accordingly.
- cancel_before_shmem_exit is only guaranteed to remove the
most-recently-added callback.Yea :(. I think that's safe for the current usages but seems mighty
fragile... Not sure what to do about it. Just register KillSlot once and
keep it registered?
Yep. Use a module-private flag to decide whether it needs to do anything.
- A comment in KillSlot wonders whether locking is required. I say
yes. It's safe to take lwlocks and spinlocks during shmem exit, and
failing to do so seems like a recipe for subtle corner-case bugs.I agree that it's safe to use spinlocks, but lwlocks? What if we are
erroring out while holding an lwlock? Code that runs inside a
TransactionCommand is protected against that, but if not ProcKill()
which invokes LWLockReleaseAll() runs pretty late in the teardown
process...
Hmm. I guess it'd be fine to decide that a connected slot can be
marked not-connected without the lock. I think you'd want a rule that
a slot can't be freed except when it's not-connected; otherwise, you
might end up marking the slot not-connected after someone else had
already recycled it for an unrelated purpose (drop slot, create new
slot).
- There seems to be no interface to acquire or release slots from
either SQL or the replication protocol, nor any way for a client of
this code to update its slot details.I don't think either ever needs to do that - START_TRANSACTION SLOT slot
...; and decoding_slot_*changes will acquire/release for them while
active. What would the usecase be to allow them to be acquired from SQL?
My point isn't so much about SQL as that with just this patch I don't
see any way for anyone to ever acquire a slot for anything, ever. So
I think there's a piece missing, or three.
The slot details get updates by the respective replication code. For
streaming rep, that should happen via reply and feedback messages. For
changeset extraction it happens when LogicalConfirmReceivedLocation() is
called; the walsender interface does that using reply messages, the SQL
interface calls it when finished (unless you use the _peek_ functions).
Right, but where is this code? I don't see this updating the reply
and feedback message processing code to touch slots. Did I miss that?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2014-01-16 09:34:51 -0500, Robert Haas wrote:
- ReplicationSlotAcquire probably needs to ignore slots that are not active.
Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.
active != in_use.
I suppose your point is that the slot can't be in_use if it's not also
active.
Yes. There's asserts to that end...
Maybe it would be better to get rid of active/in_use and have
three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
REPLSLOT_FREE. Or something like that.
Hm. Color me unenthusiastic. If you feel strongly I can change it, but
otherwise not.
- If there's a coding rule that slot->database can't be changed while
the slot is active, then the check to make sure that the user isn't
trying to bind to a slot with a mis-matching database could be done
before the code described in the previous point, avoiding the need to
go back and release the resource.I don't think slot->database should be allowed to change at all...
Well, it can if the slot is dropped and a new one created.
Well. That obviously requires the lwlock to be acquired...
- I think the critical section in ReplicationSlotDrop is bogus. If
DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
gone.Well, if delete slot fails, we don't really know at which point it
failed which means that the on-disk state might not correspond to the
in-memory state. I don't see a point in adding code trying to handle
that case correctly...Deleting the slot should be an atomic operation. There's some
critical point before which the slot will be picked up by recovery and
after which it won't. You either did that operation, or not, and can
adjust the in-memory state accordingly.
I am not sure I understand that point. We can either update the
in-memory bit before performing the on-disk operations or
afterwards. Either way, there's a way to be inconsistent if the disk
operation fails somewhere inbetween (it might fail but still have
deleted the file/directory!). The normal way to handle that in other
places is PANICing when we don't know so we recover from the on-disk
state.
I really don't see the problem here? Code doesn't get more robust by
doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
ERROR, often that's not warranted.
- A comment in KillSlot wonders whether locking is required. I say
yes. It's safe to take lwlocks and spinlocks during shmem exit, and
failing to do so seems like a recipe for subtle corner-case bugs.I agree that it's safe to use spinlocks, but lwlocks? What if we are
erroring out while holding an lwlock? Code that runs inside a
TransactionCommand is protected against that, but if not ProcKill()
which invokes LWLockReleaseAll() runs pretty late in the teardown
process...Hmm. I guess it'd be fine to decide that a connected slot can be
marked not-connected without the lock.
I now acquire the spinlock since that has to work, or we have much worse
problems... That guarantees that other backends see the value as well.
I think you'd want a rule that
a slot can't be freed except when it's not-connected; otherwise, you
might end up marking the slot not-connected after someone else had
already recycled it for an unrelated purpose (drop slot, create new
slot).
Yea, that rule is there. Otherwise we'd get in great trouble.
- There seems to be no interface to acquire or release slots from
either SQL or the replication protocol, nor any way for a client of
this code to update its slot details.I don't think either ever needs to do that - START_TRANSACTION SLOT slot
...; and decoding_slot_*changes will acquire/release for them while
active. What would the usecase be to allow them to be acquired from SQL?My point isn't so much about SQL as that with just this patch I don't
see any way for anyone to ever acquire a slot for anything, ever. So
I think there's a piece missing, or three.
The slot is acquired by code using the slot. So when START_TRANSACTION
SLOT ... (in contrast to a START_TRANSACTION without SLOT) is sent,
walsender.c does an ReplicationSlotAcquire(cmd->slotname) in
StartReplication() and releases it after it has finished.
The slot details get updates by the respective replication code. For
streaming rep, that should happen via reply and feedback
messages. For changeset extraction it happens when
LogicalConfirmReceivedLocation() is called; the walsender interface
does that using reply messages, the SQL interface calls it when
finished (unless you use the _peek_ functions).Right, but where is this code? I don't see this updating the reply
and feedback message processing code to touch slots. Did I miss that?
It's in "wal_decoding: logical changeset extraction walsender interface"
currently :(. Splitting the streaming replication part of that patch off
isn't easy...
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/16/2014 02:28 AM, Robert Haas wrote:
- If you address /* FIXME: apply sanity checking to slot name */, then
I think that also addresses /* XXX: do we want to use truncate
identifier instead? */. In other words, let's just error out if the
name is too long. I'm not sure what other sanity checking is needed
here; maybe just restrict it to 7-bit characters to avoid problems
with encodings for individual databases varying.
It's a common misunderstanding that restricting to 7-bit solves encoding
issues.
Thanks to the joy that is SHIFT_JIS, we must also disallow the backslash
and tilde characters.
Anybody who actually uses SHIFT_JIS as an operational encoding, rather
than as an input/output encoding, is into pain and suffering. Personally
I'd be quite happy to see it supported as client_encoding, but forbidden
as a server-side encoding. That's not the case right now - so since we
support it, we'd better guard against its quirks.
slotnames can't be regular identifiers, because they might contain chars
not valid in another DB's encoding. So lets just restrict them to
[a-zA-Z0-9_ -] and be done with it.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
Anybody who actually uses SHIFT_JIS as an operational encoding, rather
than as an input/output encoding, is into pain and suffering. Personally
I'd be quite happy to see it supported as client_encoding, but forbidden
as a server-side encoding. That's not the case right now - so since we
support it, we'd better guard against its quirks.
I think that *is* the case right now. pg_wchar.h sayeth:
/* followings are for client encoding only */
PG_SJIS, /* Shift JIS
(Winindows-932) */
PG_BIG5, /* Big5 (Windows-950) */
PG_GBK, /* GBK (Windows-936) */
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 16, 2014 at 9:54 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Maybe it would be better to get rid of active/in_use and have
three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
REPLSLOT_FREE. Or something like that.Hm. Color me unenthusiastic. If you feel strongly I can change it, but
otherwise not.
I found the active/in_use distinction confusing; I thought one
three-state flag rather than two Booleans might be clearer. But I
might be able to just suck it up.
- If there's a coding rule that slot->database can't be changed while
the slot is active, then the check to make sure that the user isn't
trying to bind to a slot with a mis-matching database could be done
before the code described in the previous point, avoiding the need to
go back and release the resource.I don't think slot->database should be allowed to change at all...
Well, it can if the slot is dropped and a new one created.
Well. That obviously requires the lwlock to be acquired...
Right, so the point of this comment originally was you had some logic
that could be moved sooner to avoid having to undo so much on a
failure.
- I think the critical section in ReplicationSlotDrop is bogus. If
DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
gone.Well, if delete slot fails, we don't really know at which point it
failed which means that the on-disk state might not correspond to the
in-memory state. I don't see a point in adding code trying to handle
that case correctly...Deleting the slot should be an atomic operation. There's some
critical point before which the slot will be picked up by recovery and
after which it won't. You either did that operation, or not, and can
adjust the in-memory state accordingly.I am not sure I understand that point. We can either update the
in-memory bit before performing the on-disk operations or
afterwards. Either way, there's a way to be inconsistent if the disk
operation fails somewhere inbetween (it might fail but still have
deleted the file/directory!). The normal way to handle that in other
places is PANICing when we don't know so we recover from the on-disk
state.
I really don't see the problem here? Code doesn't get more robust by
doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
ERROR, often that's not warranted.
People get cranky when the database PANICs because of a filesystem
failure. We should avoid that, especially when it's trivial to do so.
The update to shared memory should be done second and should be set
up to be no-fail.
The slot details get updates by the respective replication code. For
streaming rep, that should happen via reply and feedback
messages. For changeset extraction it happens when
LogicalConfirmReceivedLocation() is called; the walsender interface
does that using reply messages, the SQL interface calls it when
finished (unless you use the _peek_ functions).Right, but where is this code? I don't see this updating the reply
and feedback message processing code to touch slots. Did I miss that?It's in "wal_decoding: logical changeset extraction walsender interface"
currently :(. Splitting the streaming replication part of that patch off
isn't easy...
Ack. I was hoping to work through these patches one at a time, but
that's not going to work if they are interdependent to that degree.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/18/2014 09:31 PM, Robert Haas wrote:
On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
Anybody who actually uses SHIFT_JIS as an operational encoding, rather
than as an input/output encoding, is into pain and suffering. Personally
I'd be quite happy to see it supported as client_encoding, but forbidden
as a server-side encoding. That's not the case right now - so since we
support it, we'd better guard against its quirks.I think that *is* the case right now. pg_wchar.h sayeth:
/* followings are for client encoding only */
PG_SJIS, /* Shift JIS
(Winindows-932) */
PG_BIG5, /* Big5 (Windows-950) */
PG_GBK, /* GBK (Windows-936) */
Perfect - that makes ASCII-only just fine, IMO.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
Anybody who actually uses SHIFT_JIS as an operational encoding, rather
than as an input/output encoding, is into pain and suffering. Personally
I'd be quite happy to see it supported as client_encoding, but forbidden
as a server-side encoding. That's not the case right now - so since we
support it, we'd better guard against its quirks.
I think that *is* the case right now.
SHIFT_JIS is not and never will be allowed as a server encoding,
precisely because it has multi-byte characters of which some bytes could
be taken for ASCII. The same is true of our other client-only encodings.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/18/2014 02:31 PM, Robert Haas wrote:
On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
Anybody who actually uses SHIFT_JIS as an operational encoding, rather
than as an input/output encoding, is into pain and suffering. Personally
I'd be quite happy to see it supported as client_encoding, but forbidden
as a server-side encoding. That's not the case right now - so since we
support it, we'd better guard against its quirks.I think that *is* the case right now. pg_wchar.h sayeth:
/* followings are for client encoding only */
PG_SJIS, /* Shift JIS
(Winindows-932) */
while you have that file open: s/Winindows-932/Windows-932 maybe?
Stefan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
I am not sure I understand that point. We can either update the
in-memory bit before performing the on-disk operations or
afterwards. Either way, there's a way to be inconsistent if the disk
operation fails somewhere inbetween (it might fail but still have
deleted the file/directory!). The normal way to handle that in other
places is PANICing when we don't know so we recover from the on-disk
state.
I really don't see the problem here? Code doesn't get more robust by
doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
ERROR, often that's not warranted.People get cranky when the database PANICs because of a filesystem
failure. We should avoid that, especially when it's trivial to do so.
The update to shared memory should be done second and should be set
up to be no-fail.
I don't see how that would help. If we fail during unlink/rmdir, we
don't really know at which point we failed. Just keeping the slot in
memory, won't help us in any way - we'll continue to reserve resources
while the slot is half-gone.
I don't think trying to handle errors we don't understand and we don't
routinely expect actually improves robustness. It just leads to harder
to diagnose errors. It's not like the cases here are likely to be caused
by anthything but severe admin failure like removing the write
permissions of the postgres directory while the server is running. Or do
you see more valid causes?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
I am not sure I understand that point. We can either update the
in-memory bit before performing the on-disk operations or
afterwards. Either way, there's a way to be inconsistent if the disk
operation fails somewhere inbetween (it might fail but still have
deleted the file/directory!). The normal way to handle that in other
places is PANICing when we don't know so we recover from the on-disk
state.
I really don't see the problem here? Code doesn't get more robust by
doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
ERROR, often that's not warranted.People get cranky when the database PANICs because of a filesystem
failure. We should avoid that, especially when it's trivial to do so.
The update to shared memory should be done second and should be set
up to be no-fail.I don't see how that would help. If we fail during unlink/rmdir, we
don't really know at which point we failed.
This doesn't make sense to me. unlink/rmdir are atomic operations.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
Attached is v7.1 of the patchset with (among others) the following
changes:
* rebase to master
* split the slot support for streaming replication into a separate
patch, early in the series
* slot names are now limited to /[a-z0-9_]{1,NAMEDATALEN-1}/
* computation of the initial xmin for changeset extraction is now done
with an extra routine getting rid of races around GetOldestXmin()
going forwards and then backwards and getting rid of an additional
parameter to GetOldestXmin().
* slot locking is rejiggered according to Robert's suggestions
* comment improvements
* sgml documentation improvements
* ...
I think sgml the documentation is in a reasonable shape now, I'd
appreciate somebody else having a look. I think a bit more effort is
required in protocol.sgml.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 2014-01-22 10:14:27 -0500, Robert Haas wrote:
On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
I am not sure I understand that point. We can either update the
in-memory bit before performing the on-disk operations or
afterwards. Either way, there's a way to be inconsistent if the disk
operation fails somewhere inbetween (it might fail but still have
deleted the file/directory!). The normal way to handle that in other
places is PANICing when we don't know so we recover from the on-disk
state.
I really don't see the problem here? Code doesn't get more robust by
doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
ERROR, often that's not warranted.People get cranky when the database PANICs because of a filesystem
failure. We should avoid that, especially when it's trivial to do so.
The update to shared memory should be done second and should be set
up to be no-fail.I don't see how that would help. If we fail during unlink/rmdir, we
don't really know at which point we failed.This doesn't make sense to me. unlink/rmdir are atomic operations.
Yes, individual operations should be, but you cannot be sure whether a
rename()/unlink() will survive a crash until the directory is
fsync()ed. So, what is one going to do if the unlink suceeded, but the
fsync didn't?
Deletion currently works like:
if (rename(path, tmppath) != 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not rename \"%s\" to \"%s\": %m",
path, tmppath)));
/* make sure no partial state is visible after a crash */
fsync_fname(tmppath, false);
fsync_fname("pg_replslot", true);
if (!rmtree(tmppath, true))
{
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove directory \"%s\": %m",
tmppath)));
}
If we fail between the rename() and the fsync_fname() we don't really
know which state we are in. We'd also have to add code to handle
incomplete slot directories, which currently only exists for startup, to
other places.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 22, 2014 at 10:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Yes, individual operations should be, but you cannot be sure whether a
rename()/unlink() will survive a crash until the directory is
fsync()ed. So, what is one going to do if the unlink suceeded, but the
fsync didn't?
Well, apparently, one is going to PANIC and reinitialize the system.
I presume that upon reinitialization we'll decide that the slot is
gone, and thus won't recreate it in shared memory. Of course, if the
entire system suffers a hard power failure after that and before the
directory is succesfully fsync'd, then the slot could reappear on the
next startup. Which is also exactly what would happen if we removed
the slot from shared memory after doing the unlink, and then the
system suffered a hard power failure before the directory contents
made it to disk. Except that we also panicked.
In the case of shared buffers, the way we handle fsync failures is by
not allowing the system to checkpoint until all of the fsyncs succeed.
If there's an OS-level reset before that happens, WAL replay will
perform the same buffer modifications over again and the next
checkpoint will again try to flush them to disk and will not complete
unless it does. That forms a closed system where we never advance the
redo pointer over the covering WAL record until the changes it covers
are on the disk. But I don't think this code has any similar
interlock; if it does, I missed it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2014-01-22 13:00:44 -0500, Robert Haas wrote:
Well, apparently, one is going to PANIC and reinitialize the system.
I presume that upon reinitialization we'll decide that the slot is
gone, and thus won't recreate it in shared memory.
Yea, and if it's half-gone we'll continue deletion. And since yesterday
evening we'll even fsync things during startup to handle scenarios
similar to 20140122162115.GL21170@alap3.anarazel.de .
Of course, if the entire system suffers a hard power failure after that and before the
directory is succesfully fsync'd, then the slot could reappear on the
next startup. Which is also exactly what would happen if we removed
the slot from shared memory after doing the unlink, and then the
system suffered a hard power failure before the directory contents
made it to disk. Except that we also panicked.
Yes, but that could only happen as long as no relevant data has been
lost since we hold relevant locks during this.
In the case of shared buffers, the way we handle fsync failures is by
not allowing the system to checkpoint until all of the fsyncs succeed.
I don't think shared buffers fsyncs are the apt comparison. It's more
something like UpdateControlFile(). Which PANICs.
I really don't get why you fight PANICs in general that much. There are
some nasty PANICs in postgres which can happen in legitimate situations,
which should be made to fail more gracefully, but this surely isn't one
of them. We're doing rename(), unlink() and rmdir(). That's it.
We should concentrate on the ones that legitimately can happen, not the
ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
mount -o remount,ro /. We don't increase reliability by a bit adding
codepaths that will never get tested.
If there's an OS-level reset before that happens, WAL replay will
perform the same buffer modifications over again and the next
checkpoint will again try to flush them to disk and will not complete
unless it does. That forms a closed system where we never advance the
redo pointer over the covering WAL record until the changes it covers
are on the disk. But I don't think this code has any similar
interlock; if it does, I missed it.
No, it doesn't (until the first rename() at least), but the number of
failure scenarios is far smaller.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 23, 2014 at 7:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I don't think shared buffers fsyncs are the apt comparison. It's more
something like UpdateControlFile(). Which PANICs.I really don't get why you fight PANICs in general that much. There are
some nasty PANICs in postgres which can happen in legitimate situations,
which should be made to fail more gracefully, but this surely isn't one
of them. We're doing rename(), unlink() and rmdir(). That's it.
We should concentrate on the ones that legitimately can happen, not the
ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
mount -o remount,ro /. We don't increase reliability by a bit adding
codepaths that will never get tested.
Sorry, I don't buy it. Lots of people I know have stories that go
like this "$HORRIBLE happened, and PostgreSQL kept on running, and it
didn't even lose my data!", where $HORRIBLE may be variously that the
disk filled up, that disk writes started failing with I/O errors, that
somebody changed the permissions on the data directory inadvertently,
that the entire data directory got removed, and so on. I've been
through some of those scenarios myself, and the care and effort that's
been put into failure modes has saved my bacon more than a few times,
too. We *do* increase reliability by worrying about what will happen
even in code paths that very rarely get exercised. It's certainly
true that our bug count there is higher there than for the parts of
our code that get exercised more regularly, but it's also lower than
it would be if we didn't make the effort, and the dividend that we get
from that effort is that we have a well-deserved reputation for
reliability.
I think it's completely unacceptable for the failure of routine
filesystem operations to result in a PANIC. I grant you that we have
some existing cases where that can happen (like UpdateControlFile),
but that doesn't mean we should add more. Right this very minute
there is massive bellyaching on a nearby thread caused by the fact
that a full disk condition while writing WAL can PANIC the server,
while on this thread at the very same time you're arguing that adding
more ways for a full disk to cause PANICs won't inconvenience anyone.
The other thread is right, and your argument here is wrong. We have
been able to - and have taken the time to - fix comparable problems in
other cases, and we should do the same thing here.
As for why I fight PANICs so much in general, there are two reasons.
First, I believe that to be project policy. I welcome correction if I
have misinterpreted our stance in that area. Second, I have
encountered a few situations where customers had production servers
that repeatedly PANICked due to some bug or other. If I've ever
encountered angrier customers, I can't remember when. A PANIC is no
big deal when it happens on your development box, but when it happens
on a machine with 100 users connected to it, it's a big deal,
especially if a single underlying cause makes it happen over and over
again.
I think we should be devoting our time to figuring how to improve
this, not whether to improve it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-01-23 11:50:57 -0500, Robert Haas wrote:
On Thu, Jan 23, 2014 at 7:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I don't think shared buffers fsyncs are the apt comparison. It's more
something like UpdateControlFile(). Which PANICs.I really don't get why you fight PANICs in general that much. There are
some nasty PANICs in postgres which can happen in legitimate situations,
which should be made to fail more gracefully, but this surely isn't one
of them. We're doing rename(), unlink() and rmdir(). That's it.
We should concentrate on the ones that legitimately can happen, not the
ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
mount -o remount,ro /. We don't increase reliability by a bit adding
codepaths that will never get tested.Sorry, I don't buy it. Lots of people I know have stories that go
like this "$HORRIBLE happened, and PostgreSQL kept on running, and it
didn't even lose my data!", where $HORRIBLE may be variously that the
disk filled up, that disk writes started failing with I/O errors, that
somebody changed the permissions on the data directory inadvertently,
that the entire data directory got removed, and so on.
Especially the "not loosing data" imo is because postgres is
conservative with continuing in situations it doesn't know anything
about. Most prominently the cluster wide restart after a segfault.
I've been
through some of those scenarios myself, and the care and effort that's
been put into failure modes has saved my bacon more than a few times,
too. We *do* increase reliability by worrying about what will happen
even in code paths that very rarely get exercised.
A part of thinking about them *is* restricting what happens in those
cases by keeping the possible states to worry about to a minimum.
Just splapping on an ERROR instead of PANIC can make things much
worse. Not releasing space until a restart, without a chance to do
anything about it because we failed to properly release the in-memory
slot will just make the problem bigger because now the cleanup might
take a week (VACUUM FULLing the entire cluster?).
I think it's completely unacceptable for the failure of routine
filesystem operations to result in a PANIC. I grant you that we have
some existing cases where that can happen (like UpdateControlFile),
but that doesn't mean we should add more. Right this very minute
there is massive bellyaching on a nearby thread caused by the fact
that a full disk condition while writing WAL can PANIC the server,
while on this thread at the very same time you're arguing that adding
more ways for a full disk to cause PANICs won't inconvenience anyone.
A full disk won't cause any of the problems for the case we're
discussing, will it? We're just doing rename(), unlink(), rmdir() here,
all should succeed while the FS is full (afair rename() does on all
common FSs because inodes are kept separately).
The other thread is right, and your argument here is wrong. We have
been able to - and have taken the time to - fix comparable problems in
other cases, and we should do the same thing here.
I don't think the WAL case is comparable at all. ENOSPC is something
expected that can happen during normal operation and doesn't include
malintended operator and is reasonably easy to test. unlink() or fsync()
randomly failing is not.
In fact, isn't the consequence out of that thread that we need a
significant amount of extra complexity to handle the case? We shouldn't
spend that effort for cases that don't deserve it because they are too
unlikely in practice.
And yes, there's not too many other places PANICing - because most can
rely on WAL handling those tricky cases for them...
Second, I have
encountered a few situations where customers had production servers
that repeatedly PANICked due to some bug or other. If I've ever
encountered angrier customers, I can't remember when. A PANIC is no
big deal when it happens on your development box, but when it happens
on a machine with 100 users connected to it, it's a big deal,
especially if a single underlying cause makes it happen over and over
again.
Sure. But blindly continuing and then, possibly quite a bit later,
loosing data, causing an outage that takes a long while to recover or
something isn't any better.
I think we should be devoting our time to figuring how to improve
this, not whether to improve it.
If you'd argue that creating a new slot should fail gracefull, ok, I can
relatively easily be convinced of that. But trying to handle failures in
the midst of deletion in cases that won't happen in reality is just
inviting trouble imo.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers