Sync Rep: First Thoughts on Code

Started by Simon Riggsover 17 years ago178 messageshackers
Jump to latest
#1Simon Riggs
simon@2ndQuadrant.com

Breaking down of patch into sections works very well for review. Should
allow us to get different reviewers on different parts of the code -
review wranglers please take note: Dave, Josh.

Can you confirm that all the docs on the Wiki page are up to date? There
are a few minor discrepancies that make me think it isn't.

Examples: "For example, to make a single multi-statement transaction
replication asynchronously when the default is the opposite, issue SET
LOCAL synchronous_commit TO OFF within the transaction."
Do we mean synchronous_replication in this sentence? I think you've
copied the text and not changed all of the necessary parts - please
re-read the whole section (probably the whole Wiki, actually).

"wal_writer_delay" - do we mean wal_sender_delay? Is there some ability
to measure the amount of data to be sent and avoid the delay altogether,
when the server is sufficiently busy?

The reaction to replication_timeout may need to be configurable. I might
not want to keep on processing if the information didn't reach the
standby. I would prefer in many cases that the transactions that were
waiting for walsender would abort, but the walsender kept processing.
How can we restart the walsender if it shuts down? Do we want a maximum
wait for a transaction and a maximum wait for the server? Do we report
stats on how long the replication has been taking? If the average rep
time is close to rep timeout then we will be fragile, so we need some
way to notice this and produce warnings. Or at least provide info to an
external monitoring system.

How do we specify the user we use to connect to primary?

Definitely need more explanatory comments/README-style docs.

For example, 03_libpq seems simple and self-contained. I'm not sure why
we have a state called PGASYNC_REPLICATION; I was hoping that would be
dynamic, but I'm not sure where to look for that. It would be useful to
have a very long comment within the code to explain how the replication
messages work, and note on each function who the intended client and
server is.

02_pqcomm: What does HAVE_POLL mean? Do we need to worry about periodic
renegotiation of keys in be-secure.c? Not sure I understand why so many
new functions in there.

04_recovery_conf is a change I agree with, though I think it may not
work with EXEC_BACKEND for Windows.

05... I need dome commentary to explain this better.

06 and 07 are large and will take substantial review time. So we must
get the overall architecture done first and then check the code that
implements that.

08 - I think I get this, but some docs will help to confirm.

09 pg_standby changes: so more changes are coming there? OK. Can we
refer to those two options as failover and switchover? There's no need
to change definitions that many Postgres people already use. This change
can be done without making any change to server behaviour, so this
change can have benefit to 8.2 and 8,3 people also.

01_signal_handling: I've looked at the LWlock acquires and releases in
the patch and am fairly happy, except for the ProcArrayLock acquire
during this sub-patch. Do we really need to do things this way? Is the
actual state important? Could we just do this with a counter which
cycles? So callers increment counter atomically and the reader just
polls to see if anybody has incremented? Or could we protect that part
of the proc with a different lock? Touching ProcArrayLock is bad news.

Anyway, feeling very positive about this. Hope we can get this reviewed
and committed in next 3-4 weeks.

I have many clues as to how to structure my own work also. Thanks.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#1)
Re: Sync Rep: First Thoughts on Code

Hi, Simon.

Thanks for taking many hours to review the code!!

On Mon, Dec 1, 2008 at 8:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Can you confirm that all the docs on the Wiki page are up to date? There
are a few minor discrepancies that make me think it isn't.

Documentation is ongoing. Sorry for my slow progress.

BTW, I'm going to add and change the sgml files listed on wiki.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Documentation_Plan

Examples: "For example, to make a single multi-statement transaction
replication asynchronously when the default is the opposite, issue SET
LOCAL synchronous_commit TO OFF within the transaction."
Do we mean synchronous_replication in this sentence? I think you've
copied the text and not changed all of the necessary parts - please
re-read the whole section (probably the whole Wiki, actually).

Oops! It's just typo. Sorry for the confusion.
I will revise this section.

"wal_writer_delay" - do we mean wal_sender_delay?

Yes. I will fix it.

Is there some ability
to measure the amount of data to be sent and avoid the delay altogether,
when the server is sufficiently busy?

Why is the former ability required?

The latter is possible, I think. We can guarantee that the WAL is sent (in
more detail, called send(2)) once at least per wal_sender_delay. Of course,
it's dependent on the scheduler of a kernel.

The reaction to replication_timeout may need to be configurable. I might
not want to keep on processing if the information didn't reach the
standby.

OK. I will add new GUC variable (PGC_SIGHUP) to specify the reaction for
the timeout.

I would prefer in many cases that the transactions that were
waiting for walsender would abort, but the walsender kept processing.

Is it dangerous to abort the transaction with replication continued when
the timeout occurs? I think that the WAL consistency between two servers
might be broken. Because the WAL writing and sending are done concurrently,
and the backend might already write the WAL to disk on the primary when
waiting for walsender.

How can we restart the walsender if it shuts down?

Only restart the standby (with walreceiver). The standby connects to
the postmaster on the primary, then the postmaster forks new walsender.

Do we want a maximum
wait for a transaction and a maximum wait for the server?

ISTM that these feature are too much.

Do we report
stats on how long the replication has been taking? If the average rep
time is close to rep timeout then we will be fragile, so we need some
way to notice this and produce warnings. Or at least provide info to an
external monitoring system.

Sounds good. How about log_min_duration_replication? If the rep time
is greater than it, we produce warning (or log) like log_min_duration_xx.

How do we specify the user we use to connect to primary?

Yes, I need to add new option to specify the user name into
recovery.conf. Thanks for reminding me!

Definitely need more explanatory comments/README-style docs.

Completely agreed ;-)
I will write README together with other documents.

For example, 03_libpq seems simple and self-contained. I'm not sure why
we have a state called PGASYNC_REPLICATION; I was hoping that would be
dynamic, but I'm not sure where to look for that. It would be useful to
have a very long comment within the code to explain how the replication
messages work, and note on each function who the intended client and
server is.

OK. I will re-consider whether PGASYNC_REPLICATION is removable, and
write the comment about it.

02_pqcomm: What does HAVE_POLL mean?

It identifies whether poll(2) is available or not on the platform. We
use poll(2)
if it's defined, otherwise select(2). There is similar code at pqSocketPoll() in
fe-misc.c.

Do we need to worry about periodic
renegotiation of keys in be-secure.c?

What is "keys" you mean?

Not sure I understand why so many
new functions in there.

It's because walsender waits for the reply from the standby and the
request from the backend concurrently. So, we need poll(2) or select(2)
to make walsender wait for them, and some functions for non-blocking
receiving.

04_recovery_conf is a change I agree with, though I think it may not
work with EXEC_BACKEND for Windows.

OK. I will examine and fix it.

05... I need dome commentary to explain this better.

06 and 07 are large and will take substantial review time. So we must
get the overall architecture done first and then check the code that
implements that.

08 - I think I get this, but some docs will help to confirm.

Yes. I need more documentation.

09 pg_standby changes: so more changes are coming there? OK. Can we
refer to those two options as failover and switchover?

You mean failover trigger and switchover one? ISTM that those names
and features might not suit.

Naming always bother me, and the current name "commit/abort trigger"
might tend to cause confusion. Is there any other suitable name?

There's no need
to change definitions that many Postgres people already use. This change
can be done without making any change to server behaviour, so this
change can have benefit to 8.2 and 8,3 people also.

Agreed.

01_signal_handling: I've looked at the LWlock acquires and releases in
the patch and am fairly happy, except for the ProcArrayLock acquire
during this sub-patch. Do we really need to do things this way? Is the
actual state important? Could we just do this with a counter which
cycles? So callers increment counter atomically and the reader just
polls to see if anybody has incremented? Or could we protect that part
of the proc with a different lock? Touching ProcArrayLock is bad news.

Agreed. I will add new lock for proc.signalFlags.

Anyway, feeling very positive about this. Hope we can get this reviewed
and committed in next 3-4 weeks.

I have many clues as to how to structure my own work also. Thanks.

Thanks again!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#2)
Re: Sync Rep: First Thoughts on Code

On Tue, 2008-12-02 at 21:37 +0900, Fujii Masao wrote:

Thanks for taking many hours to review the code!!

On Mon, Dec 1, 2008 at 8:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Can you confirm that all the docs on the Wiki page are up to date? There
are a few minor discrepancies that make me think it isn't.

Documentation is ongoing. Sorry for my slow progress.

BTW, I'm going to add and change the sgml files listed on wiki.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Documentation_Plan

I'm patient, I know it takes time. Happy to spend hours on the review,
but I want to do that knowing I agree with the higher level features and
architecture first.

This was just a first review, I expect to spend more time on it yet.

The reaction to replication_timeout may need to be configurable. I might
not want to keep on processing if the information didn't reach the
standby.

OK. I will add new GUC variable (PGC_SIGHUP) to specify the reaction for
the timeout.

I would prefer in many cases that the transactions that were
waiting for walsender would abort, but the walsender kept processing.

Is it dangerous to abort the transaction with replication continued when
the timeout occurs? I think that the WAL consistency between two servers
might be broken. Because the WAL writing and sending are done concurrently,
and the backend might already write the WAL to disk on the primary when
waiting for walsender.

The issue I see is that we might want to keep wal_sender_delay small so
that transaction times are not increased. But we also want
wal_sender_delay high so that replication never breaks. It seems better
to have the action on wal_sender_delay configurable if we have an
unsteady network (like the internet). Marcus made some comments on line
dropping that seem relevant here; we should listen to his experience.

Hmmm, dangerous? Well assuming we're linking commits with replication
sends then it sounds it. We might end up committing to disk and then
deciding to abort instead. But remember we don't remove the xid from
procarray or mark the result in clog until the flush is over, so it is
possible. But I think we should discuss this in more detail when the
main patch is committed.

Do we report
stats on how long the replication has been taking? If the average rep
time is close to rep timeout then we will be fragile, so we need some
way to notice this and produce warnings. Or at least provide info to an
external monitoring system.

Sounds good. How about log_min_duration_replication? If the rep time
is greater than it, we produce warning (or log) like log_min_duration_xx.

Maybe, lets put in something that logs if >50% (?) of timeout. Make that
configurable with a #define and see if we need that to be configurable
with a GUC later.

Do we need to worry about periodic
renegotiation of keys in be-secure.c?

What is "keys" you mean?

See the notes in that file for explanation.

I wondered whether it might be a perf problem for us?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#4Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#3)
Re: Sync Rep: First Thoughts on Code

On Tue, 2008-12-02 at 13:09 +0000, Simon Riggs wrote:

Is it dangerous to abort the transaction with replication continued when
the timeout occurs? I think that the WAL consistency between two servers
might be broken. Because the WAL writing and sending are done concurrently,
and the backend might already write the WAL to disk on the primary when
waiting for walsender.

The issue I see is that we might want to keep wal_sender_delay small so
that transaction times are not increased. But we also want
wal_sender_delay high so that replication never breaks. It seems better
to have the action on wal_sender_delay configurable if we have an
unsteady network (like the internet). Marcus made some comments on line
dropping that seem relevant here; we should listen to his experience.

Hmmm, dangerous? Well assuming we're linking commits with replication
sends then it sounds it. We might end up committing to disk and then
deciding to abort instead. But remember we don't remove the xid from
procarray or mark the result in clog until the flush is over, so it is
possible. But I think we should discuss this in more detail when the
main patch is committed.

What is the "it" in "it is possible"? It seems like there's still a
problem window in there.

Even if that could be made safe, in the event of a real network failure,
you'd just wait the full timeout every transaction, because it still
thinks it's replicating.

If the timeout is exceeded, it seems more reasonable to abandon the
slave until you could re-sync it and continue processing as normal. As
you pointed out, that's not necessarily an expensive operation because
you can use something like rsync. The process of re-syncing might be
made easier (or perhaps less costly), of course.

If we want to still allow processing to happen after a timeout, it seems
reasonable to have a configurable option to allow/disallow non-read-only
transactions when out of sync.

Regards,
Jeff Davis

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#4)
Re: Sync Rep: First Thoughts on Code

On Tue, 2008-12-02 at 11:08 -0800, Jeff Davis wrote:

On Tue, 2008-12-02 at 13:09 +0000, Simon Riggs wrote:

Is it dangerous to abort the transaction with replication continued when
the timeout occurs? I think that the WAL consistency between two servers
might be broken. Because the WAL writing and sending are done concurrently,
and the backend might already write the WAL to disk on the primary when
waiting for walsender.

The issue I see is that we might want to keep wal_sender_delay small so
that transaction times are not increased. But we also want
wal_sender_delay high so that replication never breaks. It seems better
to have the action on wal_sender_delay configurable if we have an
unsteady network (like the internet). Marcus made some comments on line
dropping that seem relevant here; we should listen to his experience.

Hmmm, dangerous? Well assuming we're linking commits with replication
sends then it sounds it. We might end up committing to disk and then
deciding to abort instead. But remember we don't remove the xid from
procarray or mark the result in clog until the flush is over, so it is
possible. But I think we should discuss this in more detail when the
main patch is committed.

What is the "it" in "it is possible"? It seems like there's still a
problem window in there.

Marking a transaction aborted after we have written a commit record, but
before we have removed it from proc array and marked in clog. We'd need
a special kind of WAL record to do that.

Even if that could be made safe, in the event of a real network failure,
you'd just wait the full timeout every transaction, because it still
thinks it's replicating.

True, but I did suggest having two timeouts.

There is considerable reason to reduce the timeout as well as reason to
increase it - at the same time.

Anyway, lets wait for some user experience following commit.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#6Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#1)
Re: Sync Rep: First Thoughts on Code

Breaking down of patch into sections works very well for review. Should
allow us to get different reviewers on different parts of the code -
review wranglers please take note: Dave, Josh.

Fujii-san, could you break the patch up into several parts? We have quite
a few junior reviewers who are idle right now.

--
--Josh

Josh Berkus
PostgreSQL
San Francisco

#7Josh Berkus
josh@agliodbs.com
In reply to: Jeff Davis (#4)
Re: Sync Rep: First Thoughts on Code

Jeff,

Even if that could be made safe, in the event of a real network failure,
you'd just wait the full timeout every transaction, because it still
thinks it's replicating.

Hmmm. I'd suggest that if we get timeouts for more than 10xTimeout value
in a row, that replication stops. Unfortunatley, we should probably make
that *another* configuration setting.

--
--Josh

Josh Berkus
PostgreSQL
San Francisco

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Josh Berkus (#6)
Re: Sync Rep: First Thoughts on Code

Hi,

On Wed, Dec 3, 2008 at 6:03 AM, Josh Berkus <josh@agliodbs.com> wrote:

Breaking down of patch into sections works very well for review. Should
allow us to get different reviewers on different parts of the code -
review wranglers please take note: Dave, Josh.

Fujii-san, could you break the patch up into several parts? We have quite
a few junior reviewers who are idle right now.

Yes, I divided the patch into 9 pieces. Do I need to divide it further?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#9Josh Berkus
josh@agliodbs.com
In reply to: Fujii Masao (#8)
Re: Sync Rep: First Thoughts on Code

Fujii-san,

Yes, I divided the patch into 9 pieces. Do I need to divide it further?

That's plenty. Where do reviews find the 9 pieces?

--
Josh Berkus
PostgreSQL
San Francisco

#10Fujii Masao
masao.fujii@gmail.com
In reply to: Josh Berkus (#9)
Re: Sync Rep: First Thoughts on Code

Hi,

On Wed, Dec 3, 2008 at 3:21 PM, Josh Berkus <josh@agliodbs.com> wrote:

Fujii-san,

Yes, I divided the patch into 9 pieces. Do I need to divide it further?

That's plenty. Where do reviews find the 9 pieces?

The latest patch set (v4) is on wiki.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Patch_set

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#11Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#3)
Re: Sync Rep: First Thoughts on Code

Hello,

On Tue, Dec 2, 2008 at 10:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

The reaction to replication_timeout may need to be configurable. I might
not want to keep on processing if the information didn't reach the
standby.

OK. I will add new GUC variable (PGC_SIGHUP) to specify the reaction for
the timeout.

I would prefer in many cases that the transactions that were
waiting for walsender would abort, but the walsender kept processing.

Is it dangerous to abort the transaction with replication continued when
the timeout occurs? I think that the WAL consistency between two servers
might be broken. Because the WAL writing and sending are done concurrently,
and the backend might already write the WAL to disk on the primary when
waiting for walsender.

The issue I see is that we might want to keep wal_sender_delay small so
that transaction times are not increased. But we also want
wal_sender_delay high so that replication never breaks.

Are you assuming only asynch case? In synch case, since walsender is
awoken by the signal from the backend, we don't need to keep the delay
so small. And, wal_sender_delay has no relation with the mis-termination
of replication.

It seems better
to have the action on wal_sender_delay configurable if we have an
unsteady network (like the internet). Marcus made some comments on line
dropping that seem relevant here; we should listen to his experience.

OK, I would look for his comments. Please let me know which thread has
the comments if you know.

Hmmm, dangerous? Well assuming we're linking commits with replication
sends then it sounds it. We might end up committing to disk and then
deciding to abort instead. But remember we don't remove the xid from
procarray or mark the result in clog until the flush is over, so it is
possible. But I think we should discuss this in more detail when the
main patch is committed.

If the transaction is aborted while the backend is waiting for replication,
the transaction commit command returns "false" indication to the client.
But the transaction commit record might be written in the primary and
standby. As you say, it may not be dangerous as long as the primary is
alive. But, when we recover the failed primary, clog of the transaction
is marked with "success" because of the commit record. Is it safe?

And, in that case, the transaction is treated as "sucess" on the standby,
and visible for the read-only query. On the other hand, it's invisible on
the primary. Isn't it dangerous?

Do we need to worry about periodic
renegotiation of keys in be-secure.c?

What is "keys" you mean?

See the notes in that file for explanation.

Thanks! I would check it.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Jeff Davis (#4)
Re: Sync Rep: First Thoughts on Code

Hi,

On Wed, Dec 3, 2008 at 4:08 AM, Jeff Davis <pgsql@j-davis.com> wrote:

Even if that could be made safe, in the event of a real network failure,
you'd just wait the full timeout every transaction, because it still
thinks it's replicating.

If walsender detects a real network failure, the transaction doesn't need to
wait for the timeout. Configuring keepalive options would help walsender to
detect it. Of course, though keepalive on linux might not work as expected.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#13Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#3)
Re: Sync Rep: First Thoughts on Code

Hi,

On Tue, Dec 2, 2008 at 10:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2008-12-02 at 21:37 +0900, Fujii Masao wrote:

Thanks for taking many hours to review the code!!

On Mon, Dec 1, 2008 at 8:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Can you confirm that all the docs on the Wiki page are up to date? There
are a few minor discrepancies that make me think it isn't.

Documentation is ongoing. Sorry for my slow progress.

BTW, I'm going to add and change the sgml files listed on wiki.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Documentation_Plan

I'm patient, I know it takes time. Happy to spend hours on the review,
but I want to do that knowing I agree with the higher level features and
architecture first.

Since I thought that the figure was more intelligible for some people
than my poor English, I illustrated the architecture first.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design

Are there any other parts which should be illustrated for review?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#13)
Re: Sync Rep: First Thoughts on Code

On Wed, 2008-12-03 at 21:37 +0900, Fujii Masao wrote:

Since I thought that the figure was more intelligible for some people
than my poor English, I illustrated the architecture first.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design

Are there any other parts which should be illustrated for review?

Those are very useful, thanks.

Some questions to check my understanding (expected answers in brackets)

* Diagram on p.2 has two Archives. We have just one (yes)

* We send data continuously, whether or not we are in sync/async? (yes)
So the only difference between sync/async is whether we wait when we
flush the commit? (yes)

* If we have synchronous_commit = off do we ignore
synchronous_replication = on (yes)

* If two transactions commit almost simultaneously and one is sync and
the other async then only the sync backend will wait? (Yes)

Do we definitely need the archiver to move the files written by
walreceiver to archive and then move them back out again? Seems like we
can streamline that part in many (all?) cases.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#15Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#14)
Re: Sync Rep: First Thoughts on Code

Hi,

On Wed, Dec 3, 2008 at 11:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm patient, I know it takes time. Happy to spend hours on the review,
but I want to do that knowing I agree with the higher level features and
architecture first.

I wrote the features and restrictions of Synch Rep. Please also check
it together with the figures of architecture.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#User_Overview

Some questions to check my understanding (expected answers in brackets)

* Diagram on p.2 has two Archives. We have just one (yes)

No, we need archive in both the primary and standby. The primary needs
archive because a base backup is required when starting the standby.
Meanwhile, the standby needs archive for cooperating with pg_standby.

If the directory where pg_standby checks is the same as the directory
where walreceiver writes the WAL, the halfway WAL file might be
restored by pg_standby, and continuous recovery would fail. So, we have
to separate the directories, and I assigned pg_xlog and archive to them.

Another idea; walreceiver writes the WAL to the file with temporary name,
and rename it to the formal name when it fills. So, pg_standby doesn't
restore a halfway WAL file. But it's more difficult to perform the failover
because the unrenamed WAL file remains.

Do you have any other good idea?

* We send data continuously, whether or not we are in sync/async? (yes)

Yes.

So the only difference between sync/async is whether we wait when we
flush the commit? (yes)

Yes.
And, in asynch case, the backend basically doesn't send the wakeup-signal
to walsender.

* If we have synchronous_commit = off do we ignore
synchronous_replication = on (yes)

No, we can configure them independently. synchronous_commit covers
only local writing of the WAL. If synch_*commit* should cover both local
writing and replication, I'd like to add new GUC which covers only local
writing (synchronous_local_write?).

* If two transactions commit almost simultaneously and one is sync and
the other async then only the sync backend will wait? (Yes)

Yes.

Do we definitely need the archiver to move the files written by
walreceiver to archive and then move them back out again?

Yes, it's because of cooperating with pg_standby.

Seems like we
can streamline that part in many (all?) cases.

Agreed. But I thought that such streaming was TODO of next time.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#16Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#11)
Re: Sync Rep: First Thoughts on Code

Hi,

On Wed, Dec 3, 2008 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Do we need to worry about periodic
renegotiation of keys in be-secure.c?

What is "keys" you mean?

See the notes in that file for explanation.

Thanks! I would check it.

The key is used only when we use SSL for the connection of
replication. As far as I examined, secure_write() renegotiates
the key if needed. Since walsender calls secure_write() when
sending the WAL to the standby, the key is renegotiated
periodically. So, I think that we don't need to worry about the
obsolescence of the key. Am I missing something?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#16)
Re: Sync Rep: First Thoughts on Code

On Thu, 2008-12-04 at 17:57 +0900, Fujii Masao wrote:

On Wed, Dec 3, 2008 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Do we need to worry about periodic
renegotiation of keys in be-secure.c?

What is "keys" you mean?

See the notes in that file for explanation.

Thanks! I would check it.

The key is used only when we use SSL for the connection of
replication. As far as I examined, secure_write() renegotiates
the key if needed. Since walsender calls secure_write() when
sending the WAL to the standby, the key is renegotiated
periodically. So, I think that we don't need to worry about the
obsolescence of the key.

Understood. Is the periodic renegotiation of keys something that would
interfere with the performance or robustness of replication? Is the
delay likely to effect sync rep? I'm just checking we've thought about
it.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#15)
Re: Sync Rep: First Thoughts on Code

On Thu, 2008-12-04 at 16:10 +0900, Fujii Masao wrote:

* Diagram on p.2 has two Archives. We have just one (yes)

No, we need archive in both the primary and standby. The primary needs
archive because a base backup is required when starting the standby.
Meanwhile, the standby needs archive for cooperating with pg_standby.

If the directory where pg_standby checks is the same as the directory
where walreceiver writes the WAL, the halfway WAL file might be
restored by pg_standby, and continuous recovery would fail. So, we have
to separate the directories, and I assigned pg_xlog and archive to them.

Another idea; walreceiver writes the WAL to the file with temporary name,
and rename it to the formal name when it fills. So, pg_standby doesn't
restore a halfway WAL file. But it's more difficult to perform the failover
because the unrenamed WAL file remains.

WAL sending is either via archiver or via streaming. We must switch
cleanly from one mode to the other and not half-way through a WAL file.

When WAL sending is about to begin, issue xlog switch. Then tell
archiver to shutdown once it has got to the last file. All files after
that point are streamed. So there need be no conflict in filename.

We must avoid having two archives, because people will configure this
incorrectly.

* If we have synchronous_commit = off do we ignore
synchronous_replication = on (yes)

No, we can configure them independently. synchronous_commit covers
only local writing of the WAL. If synch_*commit* should cover both local
writing and replication, I'd like to add new GUC which covers only local
writing (synchronous_local_write?).

The only sensible settings are
synchronous_commit = on, synchronous_replication = on
synchronous_commit = on, synchronous_replication = off
synchronous_commit = off, synchronous_replication = off

This doesn't make any sense: (does it??)
synchronous_commit = off, synchronous_replication = on

Do we definitely need the archiver to move the files written by
walreceiver to archive and then move them back out again?

Yes, it's because of cooperating with pg_standby.

It seems very easy to make this happen the way we want. We could make
pg_standby look into pg_xlog also, for example.

I was expecting you to have walreceiver and startup share an end of WAL
address via shared memory, so that startup never tries to read past end.
That way we would be able to begin reading a WAL file *before* it was
filled. Waiting until a file fills means we still have to have
archive_timeout set to ensure we switch regularly.

We need the existing mechanisms for the start of replication (base
backup etc..) but we don't need them after that point.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#19Magnus Hagander
magnus@hagander.net
In reply to: Simon Riggs (#17)
Re: Sync Rep: First Thoughts on Code

Simon Riggs wrote:

On Thu, 2008-12-04 at 17:57 +0900, Fujii Masao wrote:

On Wed, Dec 3, 2008 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Do we need to worry about periodic
renegotiation of keys in be-secure.c?

What is "keys" you mean?

See the notes in that file for explanation.

Thanks! I would check it.

The key is used only when we use SSL for the connection of
replication. As far as I examined, secure_write() renegotiates
the key if needed. Since walsender calls secure_write() when
sending the WAL to the standby, the key is renegotiated
periodically. So, I think that we don't need to worry about the
obsolescence of the key.

Understood. Is the periodic renegotiation of keys something that would
interfere with the performance or robustness of replication? Is the
delay likely to effect sync rep? I'm just checking we've thought about
it.

It will certainly add an extra piece of delay. But if you are worried
about performance for it, you are likely not running SSL. Plus, if you
don't renegotiate the key, you gamble with security.

If it does have a negative effect on the robustness of the replication,
we should just recommend against using it - or refuse to use - not
disable renegotiation.

/Magnus

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Magnus Hagander (#19)
Re: Sync Rep: First Thoughts on Code

On Thu, 2008-12-04 at 12:41 +0100, Magnus Hagander wrote:

Understood. Is the periodic renegotiation of keys something that would
interfere with the performance or robustness of replication? Is the
delay likely to effect sync rep? I'm just checking we've thought about
it.

It will certainly add an extra piece of delay. But if you are worried
about performance for it, you are likely not running SSL. Plus, if you
don't renegotiate the key, you gamble with security.

If it does have a negative effect on the robustness of the replication,
we should just recommend against using it - or refuse to use - not
disable renegotiation.

I didn't mean to imply renegotiation might optional. I just wanted to
check whether there is anything to worry about as a result of it, there
may not be. *If* it took a long time, I would not want sync commits to
wait for it.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#21Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#18)
#22Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#21)
#23Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#22)
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#22)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#21)
#26Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#25)
#27Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#24)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#27)
#29Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#28)
#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#29)
#31Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#30)
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#31)
#33Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#32)
#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#33)
#35Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#34)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#35)
#37Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#35)
#38Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#36)
#39Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#37)
#40Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#34)
#41Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#38)
#42Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#40)
#43Aidan Van Dyk
aidan@highrise.ca
In reply to: Simon Riggs (#41)
#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#39)
#45Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#42)
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#41)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#46)
#48Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#47)
#49Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#48)
#50Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#49)
#51Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#49)
#52Simon Riggs
simon@2ndQuadrant.com
In reply to: Aidan Van Dyk (#43)
#53Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#50)
#54Aidan Van Dyk
aidan@highrise.ca
In reply to: Simon Riggs (#52)
#55Aidan Van Dyk
aidan@highrise.ca
In reply to: Fujii Masao (#50)
#56Simon Riggs
simon@2ndQuadrant.com
In reply to: Aidan Van Dyk (#54)
#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Aidan Van Dyk (#55)
#58Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#56)
#59Aidan Van Dyk
aidan@highrise.ca
In reply to: Simon Riggs (#57)
#60Aidan Van Dyk
aidan@highrise.ca
In reply to: Heikki Linnakangas (#58)
#61Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#58)
#62Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#50)
#63Fujii Masao
masao.fujii@gmail.com
In reply to: Aidan Van Dyk (#60)
#64Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#62)
#65Aidan Van Dyk
aidan@highrise.ca
In reply to: Fujii Masao (#63)
#66Fujii Masao
masao.fujii@gmail.com
In reply to: Aidan Van Dyk (#65)
#67Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#63)
#68Aidan Van Dyk
aidan@highrise.ca
In reply to: Simon Riggs (#67)
#69Jeff Davis
pgsql@j-davis.com
In reply to: Fujii Masao (#63)
#70Jeff Davis
pgsql@j-davis.com
In reply to: Aidan Van Dyk (#68)
#71Aidan Van Dyk
aidan@highrise.ca
In reply to: Jeff Davis (#70)
#72Jeff Davis
pgsql@j-davis.com
In reply to: Aidan Van Dyk (#71)
#73Markus Wanner
markus@bluegap.ch
In reply to: Fujii Masao (#63)
#74Simon Riggs
simon@2ndQuadrant.com
In reply to: Markus Wanner (#73)
#75Markus Wanner
markus@bluegap.ch
In reply to: Simon Riggs (#74)
#76Grzegorz Jaskiewicz
gj@pointblue.com.pl
In reply to: Markus Wanner (#75)
#77Simon Riggs
simon@2ndQuadrant.com
In reply to: Markus Wanner (#75)
#78Markus Wanner
markus@bluegap.ch
In reply to: Simon Riggs (#77)
#79Robert Haas
robertmhaas@gmail.com
In reply to: Markus Wanner (#78)
#80Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#79)
#81Aidan Van Dyk
aidan@highrise.ca
In reply to: Markus Wanner (#78)
#82Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#79)
#83Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#79)
#84Hannu Krosing
hannu@tm.ee
In reply to: Hannu Krosing (#83)
#85Markus Wanner
markus@bluegap.ch
In reply to: Tom Lane (#80)
#86Markus Wanner
markus@bluegap.ch
In reply to: Simon Riggs (#82)
#87Markus Wanner
markus@bluegap.ch
In reply to: Hannu Krosing (#83)
#88Aidan Van Dyk
aidan@highrise.ca
In reply to: Markus Wanner (#87)
#89Mark Mielke
mark@mark.mielke.cc
In reply to: Markus Wanner (#85)
#90Markus Wanner
markus@bluegap.ch
In reply to: Aidan Van Dyk (#88)
#91Markus Wanner
markus@bluegap.ch
In reply to: Mark Mielke (#89)
#92Mark Mielke
mark@mark.mielke.cc
In reply to: Markus Wanner (#91)
#93Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#80)
#94Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#93)
#95Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#94)
#96Robert Haas
robertmhaas@gmail.com
In reply to: Mark Mielke (#89)
#97Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#95)
#98Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Robert Haas (#95)
#99Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#95)
#100Mark Mielke
mark@mark.mielke.cc
In reply to: Robert Haas (#93)
#101Emmanuel Cecchet
manu@frogthinker.org
In reply to: Robert Haas (#99)
#102Emmanuel Cecchet
manu@frogthinker.org
In reply to: Robert Haas (#99)
#103Simon Riggs
simon@2ndQuadrant.com
In reply to: Tatsuo Ishii (#98)
#104Mark Mielke
mark@mark.mielke.cc
In reply to: Simon Riggs (#103)
#105Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#103)
#106Mark Mielke
mark@mark.mielke.cc
In reply to: Mark Mielke (#92)
#107Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Mark Mielke (#106)
#108Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Simon Riggs (#103)
#109Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Robert Haas (#105)
#110Mark Mielke
mark@mark.mielke.cc
In reply to: Heikki Linnakangas (#107)
#111Bruce Momjian
bruce@momjian.us
In reply to: Mark Mielke (#106)
#112Robert Haas
robertmhaas@gmail.com
In reply to: Ron Mayer (#109)
#113Mark Mielke
mark@mark.mielke.cc
In reply to: Bruce Momjian (#111)
#114Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Mark Mielke (#113)
#115Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Mark Mielke (#110)
#116Simon Riggs
simon@2ndQuadrant.com
In reply to: Mark Mielke (#104)
#117Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#112)
#118Peter Eisentraut
peter_e@gmx.net
In reply to: Simon Riggs (#103)
#119Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#114)
#120Bruce Momjian
bruce@momjian.us
In reply to: Mark Mielke (#113)
#121Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#119)
#122Aidan Van Dyk
aidan@highrise.ca
In reply to: Robert Haas (#119)
#123Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#121)
#124Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#117)
#125Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#123)
#126Josh Berkus
josh@agliodbs.com
In reply to: Peter Eisentraut (#118)
#127Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Josh Berkus (#126)
#128Josh Berkus
josh@agliodbs.com
In reply to: Ron Mayer (#127)
#129Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#128)
#130Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#129)
#131Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#126)
#132Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#124)
#133Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#132)
#134Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#133)
#135Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#134)
#136Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#135)
#137Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#136)
#138Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#137)
#139Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#138)
#140Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#139)
#141Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#140)
#142Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#141)
#143Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#142)
#144Markus Wanner
markus@bluegap.ch
In reply to: Mark Mielke (#110)
#145Mark Mielke
mark@mark.mielke.cc
In reply to: Markus Wanner (#144)
#146Markus Wanner
markus@bluegap.ch
In reply to: Mark Mielke (#100)
#147Markus Wanner
markus@bluegap.ch
In reply to: Josh Berkus (#126)
#148Markus Wanner
markus@bluegap.ch
In reply to: Mark Mielke (#145)
#149Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#141)
#150Markus Wanner
markus@bluegap.ch
In reply to: Simon Riggs (#116)
#151Markus Wanner
markus@bluegap.ch
In reply to: Emmanuel Cecchet (#101)
#152Emmanuel Cecchet
manu@frogthinker.org
In reply to: Markus Wanner (#151)
#153Markus Wanner
markus@bluegap.ch
In reply to: Emmanuel Cecchet (#152)
#154Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#134)
#155Emmanuel Cecchet
manu@frogthinker.org
In reply to: Markus Wanner (#153)
#156Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#149)
#157Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#156)
#158Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#157)
#159Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#158)
#160Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Fujii Masao (#159)
#161Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#160)
#162Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Simon Riggs (#161)
#163Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#162)
#164Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#163)
#165Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#164)
#166Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#164)
#167Mark Mielke
mark@mark.mielke.cc
In reply to: Simon Riggs (#166)
#168Markus Wanner
markus@bluegap.ch
In reply to: Emmanuel Cecchet (#155)
#169Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#166)
#170Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#169)
#171Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#170)
#172Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#154)
#173Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#171)
#174Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#173)
#175Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#174)
#176Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#175)
#177Emmanuel Cecchet
manu@asterdata.com
In reply to: Markus Wanner (#168)
#178Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#176)