pg_rewind fails on Windows where tablespaces are used

Started by Chris Traversalmost 2 years ago18 messagesbugs
Jump to latest
#1Chris Travers
chris.travers@stormatics.tech

Hi,

Setup is PostgreSQL on Windows with a tablespace on a separate drive. When
I go to run pg_rewind it consistently fails with the following error:

pg_rewind: servers diverged at WAL location 39B/7EC6F60 on timeline 2
pg_rewind: rewinding from last common checkpoint at 39B/7E8E3F8 on timeline
2
pg_rewind: error: file "pg_tblspc/34244696" is of different type in source
and target

The file is confirmed to be a JUNCTION to the correct location on both the
source and target. So the error looks like a problem interacting with
Windows and detecting JUNCTION types in this case.

I came across the following which looks like it would fix this problem but
don't have a proper build environment. Please consider backporting the fix
at least as far as Postgres 15 as this bug fix does apply to non-in-place
tablespaces on Windows. The thread is
https://postgrespro.com/list/thread-id/2657122

Best Regards,
Chris Travers

#2Michael Paquier
michael@paquier.xyz
In reply to: Chris Travers (#1)
Re: pg_rewind fails on Windows where tablespaces are used

On Wed, May 08, 2024 at 03:02:21PM +0700, Chris Travers wrote:

Setup is PostgreSQL on Windows with a tablespace on a separate drive. When
I go to run pg_rewind it consistently fails with the following error:

(Chris has poked me regarding this issue last week in Vancouver.)

pg_rewind: servers diverged at WAL location 39B/7EC6F60 on timeline 2
pg_rewind: rewinding from last common checkpoint at 39B/7E8E3F8 on timeline
2
pg_rewind: error: file "pg_tblspc/34244696" is of different type in source
and target

The file is confirmed to be a JUNCTION to the correct location on both the
source and target. So the error looks like a problem interacting with
Windows and detecting JUNCTION types in this case.

I am not completely sure to follow here. Aren't you making use of an
in-place tablespace here? Could you provide more details about the
structure of the data folders, because these are on separate hosts,
right? When rewinding from a live server, readlink() returns an
absolute path for a junction point, meaning that the result would not
be influenced by bf227926d22b as we would always handle such an entry
with FILE_TYPE_SYMLINK. On Windows, the link creation would be
covered by pgsymlink(), which would create the link as a junction
point.

Note that I do not object to a backpatch of bf227926d22b, as I did not
do it for the sake of caution as in-place tablespaces are a developer
feature. If you use it for tests of your own on stable branches,
well, why not.

I came across the following which looks like it would fix this problem but
don't have a proper build environment. Please consider backporting the fix
at least as far as Postgres 15 as this bug fix does apply to non-in-place
tablespaces on Windows. The thread is
https://postgrespro.com/list/thread-id/2657122

I'd suggest to use the postgresql.org reference. This refers to
commit bf227926d22b, for the following thread:
/messages/by-id/2b79d2a8-b2d5-4bd7-a15b-31e485100980.xiyuan.zr@alibaba-inc.com

Thanks,
--
Michael

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#2)
Re: pg_rewind fails on Windows where tablespaces are used

On 2024-06-04 Tu 12:53 AM, Michael Paquier wrote:

On Wed, May 08, 2024 at 03:02:21PM +0700, Chris Travers wrote:

Setup is PostgreSQL on Windows with a tablespace on a separate drive. When
I go to run pg_rewind it consistently fails with the following error:

(Chris has poked me regarding this issue last week in Vancouver.)

pg_rewind: servers diverged at WAL location 39B/7EC6F60 on timeline 2
pg_rewind: rewinding from last common checkpoint at 39B/7E8E3F8 on timeline
2
pg_rewind: error: file "pg_tblspc/34244696" is of different type in source
and target

The file is confirmed to be a JUNCTION to the correct location on both the
source and target. So the error looks like a problem interacting with
Windows and detecting JUNCTION types in this case.

An EDB customer has encountered the same issue. They are not using
in-place tablespaces.

I have reproduced this problem on release 15, not using an in-place
tablespace.

The solution I came up with was to backpatch commits c5cb8f3b, 387803d8
and 5fc88c5d53.

I don't think we need to do anything relating to in-place tablespaces.
These are documented as a developer only option and not for production.

The only question in my mind is whether those patches should be
backpatched. It's a couple of hundred lines, and I think it's safe, but
I'd welcome other opinions. If we are going to backpatch them we should
also look at adding to adding tests for use of tablespaces with
pg_rewind on the back branches. Ideally we'd get this done in time for
the next maintenance release.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#4Michael Paquier
michael@paquier.xyz
In reply to: Andrew Dunstan (#3)
Re: pg_rewind fails on Windows where tablespaces are used

On Tue, Jul 09, 2024 at 12:01:17PM -0400, Andrew Dunstan wrote:

The solution I came up with was to backpatch commits c5cb8f3b, 387803d8 and
5fc88c5d53.

The lstat() wrapper for Windows, noted.

I don't think we need to do anything relating to in-place tablespaces. These
are documented as a developer only option and not for production.

Okay, cool.

The only question in my mind is whether those patches should be
backpatched.

It's a couple of hundred lines, and I think it's safe, but I'd welcome other
opinions. If we are going to backpatch them we should also look at adding to
adding tests for use of tablespaces with pg_rewind on the back branches.
Ideally we'd get this done in time for the next maintenance release.

Seeing that the commits all go down to v16, meaning that these have
brewed across 3 minor releases already, I'd like to assume that we
would have already heard about problems related to them. So that
seems like a rather safe thing to do at this stage.
--
Michael

#5Chris Travers
chris.travers@stormatics.tech
In reply to: Michael Paquier (#4)
Re: pg_rewind fails on Windows where tablespaces are used

Sorry for the late reply. I had apparently had it buried.

On Wed, Jul 10, 2024 at 6:12 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 09, 2024 at 12:01:17PM -0400, Andrew Dunstan wrote:

The solution I came up with was to backpatch commits c5cb8f3b, 387803d8

and

5fc88c5d53.

The lstat() wrapper for Windows, noted.

I don't think we need to do anything relating to in-place tablespaces.

These

are documented as a developer only option and not for production.

Okay, cool.

Yeah in place tablespaces are not used. They did have to be briefly
enbaled due to another issue probably with the same wrapper but they were
never used. They were disabled again shortly after.

The only question in my mind is whether those patches should be
backpatched.

It's a couple of hundred lines, and I think it's safe, but I'd welcome

other

opinions. If we are going to backpatch them we should also look at

adding to

adding tests for use of tablespaces with pg_rewind on the back branches.
Ideally we'd get this done in time for the next maintenance release.

Seeing that the commits all go down to v16, meaning that these have
brewed across 3 minor releases already, I'd like to assume that we
would have already heard about problems related to them. So that
seems like a rather safe thing to do at this stage.

Just as some added context, I have noticed that manually moving a
tablespace and creating the junction with mklink /d sometimes causes
PostgreSQL to decide this must be an in place tablespace even though dir
clearly shows it as a junction. I don't have the resources to determine if
this is limited to some builds of Windows or patch levels but the problem
goes away after a pg_basebackup rebuild so I don;t think it is something
that is so urgent, So this is more of a note that there seem to be some
issues in this area on Windows at least in 15. I don't know if that
affects discussion but it is worth noting.

Michael

Show quoted text
#6Alexandra Wang
alexandra.wang.oss@gmail.com
In reply to: Michael Paquier (#4)
Re: pg_rewind fails on Windows where tablespaces are used

Hi,

I’ve attached fixes in 4 tarballs for branches 12 to 15.

On Wed, Oct 23, 2024 at 9:38 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 09, 2024 at 12:01:17PM -0400, Andrew Dunstan wrote:

The solution I came up with was to backpatch commits c5cb8f3b, 387803d8 and
5fc88c5d53.

The lstat() wrapper for Windows, noted.

In addition to the three commits from Andrew’s solution, I also
backpatched commit f357233c. Without this commit, DROP TABLESPACE for
a non-in-place tablespace would fail with the following error:

drop tablespace regress_tblspacewith;
ERROR: 42501: could not remove symbolic link "pg_tblspc/83531":
Permission denied
LOCATION: destroy_tablespace_directories, tablespace.c:829

I encountered this issue while working on the fix for branch 14 and
running the tablespace regress test. This simple test is not covered
in branch 15’s regress tests, as we started setting
allow_in_place_tablespaces = true since commit d6d317db.

The only question in my mind is whether those patches should be
backpatched.

It's a couple of hundred lines, and I think it's safe, but I'd welcome other
opinions. If we are going to backpatch them we should also look at adding to
adding tests for use of tablespaces with pg_rewind on the back branches.
Ideally we'd get this done in time for the next maintenance release.

Seeing that the commits all go down to v16, meaning that these have
brewed across 3 minor releases already, I'd like to assume that we
would have already heard about problems related to them. So that
seems like a rather safe thing to do at this stage.

I also had to backpatch additional commits for branches 12 to 14, as
follows:

branch 14: e2f0f8ed, af9e6331, and the commits for branch 15
[f357233c, c5cb8f3b, 387803d8, and 5fc88c5d53].
branches 12 & 13: bed90759, 54fb8c7d, de8feb1f, 101c37cd, and the
commits for branch 14.

With these additional commits for branches 12 to 14, I’m not sure if
it’s worth backpatching, or should we backpatch only to branch 15?

I agree with Andrew that if we decide to backpatch these changes, we
should add tests for the pg_rewind issue. Would the pg_rewind TAP
tests be the place for them?

Best,
Alex

Attachments:

tsfixes15.tgzapplication/gzip; name=tsfixes15.tgzDownload
tsfixes14.tgzapplication/gzip; name=tsfixes14.tgzDownload
tsfixes13.tgzapplication/gzip; name=tsfixes13.tgzDownload
tsfixes12.tgzapplication/gzip; name=tsfixes12.tgzDownload
#7Michael Paquier
michael@paquier.xyz
In reply to: Alexandra Wang (#6)
Re: pg_rewind fails on Windows where tablespaces are used

On Wed, Oct 23, 2024 at 11:19:14AM -0500, Alexandra Wang wrote:

I encountered this issue while working on the fix for branch 14 and
running the tablespace regress test. This simple test is not covered
in branch 15’s regress tests, as we started setting
allow_in_place_tablespaces = true since commit d6d317db.

Yes, for the reasons stated in this commit because we rely on
everything to be on the same host, and tablespace paths would overlap
across the primary and its replica.

I also had to backpatch additional commits for branches 12 to 14, as
follows:

branch 14: e2f0f8ed, af9e6331, and the commits for branch 15
[f357233c, c5cb8f3b, 387803d8, and 5fc88c5d53].
branches 12 & 13: bed90759, 54fb8c7d, de8feb1f, 101c37cd, and the
commits for branch 14.

With these additional commits for branches 12 to 14, I’m not sure if
it’s worth backpatching, or should we backpatch only to branch 15?

12 is going to be EOL in a couple of days, so I'd rather leave it out.
If it were down to me, I'd also leave 13 and 14 as well, based on
e2f0f8ed25 to let the beast sleep there. Perhaps others have a
different opinion. though.

I agree with Andrew that if we decide to backpatch these changes, we
should add tests for the pg_rewind issue. Would the pg_rewind TAP
tests be the place for them?

Looking at your patch set, it strikes me as not really necessary to
involve multiple nodes to check the lstat() emulation, the failure
originating not really from pg_rewind, but from the contents of
src/port/. Your DROP TABLESPACE case is telling us that, at least, so
we could use a single node with tablespace manipulations to get
basically to the same coverage? Perhaps pg_checksums is one piece to
look at, as it relies on a single node. And that would be less costly.
--
Michael

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#7)
Re: pg_rewind fails on Windows where tablespaces are used

On 2024-10-23 We 7:03 PM, Michael Paquier wrote:

On Wed, Oct 23, 2024 at 11:19:14AM -0500, Alexandra Wang wrote:

I encountered this issue while working on the fix for branch 14 and
running the tablespace regress test. This simple test is not covered
in branch 15’s regress tests, as we started setting
allow_in_place_tablespaces = true since commit d6d317db.

Yes, for the reasons stated in this commit because we rely on
everything to be on the same host, and tablespace paths would overlap
across the primary and its replica.

I also had to backpatch additional commits for branches 12 to 14, as
follows:

branch 14: e2f0f8ed, af9e6331, and the commits for branch 15
[f357233c, c5cb8f3b, 387803d8, and 5fc88c5d53].
branches 12 & 13: bed90759, 54fb8c7d, de8feb1f, 101c37cd, and the
commits for branch 14.

With these additional commits for branches 12 to 14, I’m not sure if
it’s worth backpatching, or should we backpatch only to branch 15?

12 is going to be EOL in a couple of days, so I'd rather leave it out.
If it were down to me, I'd also leave 13 and 14 as well, based on
e2f0f8ed25 to let the beast sleep there. Perhaps others have a
different opinion. though.

Well, it seems like it's clearly a bug. I'm never happy leaving bugs
unfixed. As for 12, what's the point of putting out one last release if
it's not to fix bugs?

EDB's customer will probably be happy if we just fix 15, but I would
rather take a broader view and fix it for other possible users too.

I'm traveling for a few days but it was my intention to work on these
when I am back.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#9Michael Paquier
michael@paquier.xyz
In reply to: Andrew Dunstan (#8)
Re: pg_rewind fails on Windows where tablespaces are used

On Mon, Oct 28, 2024 at 03:00:12PM -0400, Andrew Dunstan wrote:

Well, it seems like it's clearly a bug. I'm never happy leaving bugs
unfixed. As for 12, what's the point of putting out one last release if it's
not to fix bugs?

There is always a risk of breaking something that worked previously,
and we would be out of options to address these once the branch is
EOL'd. The risk/reward ratio for v12 is really different, so I'd
advise some caution particularly with this area of the code.

EDB's customer will probably be happy if we just fix 15, but I would rather
take a broader view and fix it for other possible users too.

I'm traveling for a few days but it was my intention to work on these when I
am back.

Cool. Thanks!
--
Michael

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#9)
Re: pg_rewind fails on Windows where tablespaces are used

On Tue, Oct 29, 2024 at 9:48 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Oct 28, 2024 at 03:00:12PM -0400, Andrew Dunstan wrote:

Well, it seems like it's clearly a bug. I'm never happy leaving bugs
unfixed. As for 12, what's the point of putting out one last release if

it's

not to fix bugs?

There is always a risk of breaking something that worked previously,
and we would be out of options to address these once the branch is
EOL'd. The risk/reward ratio for v12 is really different, so I'd
advise some caution particularly with this area of the code.

EDB's customer will probably be happy if we just fix 15, but I would

rather

take a broader view and fix it for other possible users too.

I'm traveling for a few days but it was my intention to work on these

when I

am back.

I didn't push the fixes for release 12, like you requested, but I pushed
the rest.

Unfortunately for various reasons I don't currently have a Windows test
instance, but these were previously tested by Alexandra and me so I'm
pretty confident they will be ok.

When I get my hands on a Windows machine again I will work on adding some
Windows tests for pg_reqind with tablespaces.

cheers

andrew

#11Michael Paquier
michael@paquier.xyz
In reply to: Andrew Dunstan (#10)
Re: pg_rewind fails on Windows where tablespaces are used

On Fri, Nov 08, 2024 at 09:53:40AM +1030, Andrew Dunstan wrote:

I didn't push the fixes for release 12, like you requested, but I pushed
the rest.

Unfortunately for various reasons I don't currently have a Windows test
instance, but these were previously tested by Alexandra and me so I'm
pretty confident they will be ok.

When I get my hands on a Windows machine again I will work on adding some
Windows tests for pg_reqind with tablespaces.

OK. Thanks for the update.
--
Michael

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#11)
Re: pg_rewind fails on Windows where tablespaces are used

It might be worth checking if 4517358e and f71007fb should be
back-patched too. There was a brief discussion[1]/messages/by-id/CAD5tBcKnE3C1hycBYZYtYpNssQR_e+u2=CmDhGRFvDMEg3onRg@mail.gmail.com, but no one with
Windows-testing capabilities was around and it didn't seem too
serious, and then there was the whole re-wrap and it seemed best to
keep out of the way of that. But now that the coast is clear...

[1]: /messages/by-id/CAD5tBcKnE3C1hycBYZYtYpNssQR_e+u2=CmDhGRFvDMEg3onRg@mail.gmail.com

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Thomas Munro (#12)
Re: pg_rewind fails on Windows where tablespaces are used

On 2024-11-26 Tu 8:58 PM, Thomas Munro wrote:

It might be worth checking if 4517358e and f71007fb should be
back-patched too. There was a brief discussion[1], but no one with
Windows-testing capabilities was around and it didn't seem too
serious, and then there was the whole re-wrap and it seemed best to
keep out of the way of that. But now that the coast is clear...

[1] /messages/by-id/CAD5tBcKnE3C1hycBYZYtYpNssQR_e+u2=CmDhGRFvDMEg3onRg@mail.gmail.com

Yes, it's on my TODO list. Have been waiting for a) the releases to
settle down b) getting availability of Windows resources again and c)
being back at my home location. All three are now met, so will work on it.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#14Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#13)
Re: pg_rewind fails on Windows where tablespaces are used

On 2024-11-27 We 7:52 AM, Andrew Dunstan wrote:

On 2024-11-26 Tu 8:58 PM, Thomas Munro wrote:

It might be worth checking if 4517358e and f71007fb should be
back-patched too.  There was a brief discussion[1], but no one with
Windows-testing capabilities was around and it didn't seem too
serious, and then there was the whole re-wrap and it seemed best to
keep out of the way of that.  But now that the coast is clear...

[1]
/messages/by-id/CAD5tBcKnE3C1hycBYZYtYpNssQR_e+u2=CmDhGRFvDMEg3onRg@mail.gmail.com

Yes, it's on my TODO list. Have been waiting for a) the releases to
settle down b) getting availability of Windows resources again and c)
being back at my home location. All three are now met, so will work on
it.

Those patches didn't actually include any tests. I guess the best test
would be to create a chain of several junction points and then run
initdb on the leaf of the chain?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Andrew Dunstan (#14)
Re: pg_rewind fails on Windows where tablespaces are used

On Thu, Jan 9, 2025 at 3:45 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Those patches didn't actually include any tests. I guess the best test
would be to create a chain of several junction points and then run
initdb on the leaf of the chain?

Yeah I think the three interesting cases were initdb when run under
junctions like these:

1. Volume GUID format: mklink /J foo \\?\Volume{12341234-1234...},
expected to break without patch
2. Chain: mklink /J C:\\aaa1 C:\\aaa2, mkdir /J C:\\aaa2 c:\\aaa3,
expected to break without patch
3. Chain of length > 8, expected to fail with ELOOP once the patch is applied.

(Syntax may be off, I just googled it but don't have Windows to try).

The way to get decent tests for this stuff and all the rest of the
wrappers would probably be to develop this test suite further:

/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com

#16Andrew Dunstan
andrew@dunslane.net
In reply to: Thomas Munro (#15)
Re: pg_rewind fails on Windows where tablespaces are used

On 2025-01-08 We 10:38 PM, Thomas Munro wrote:

On Thu, Jan 9, 2025 at 3:45 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Those patches didn't actually include any tests. I guess the best test
would be to create a chain of several junction points and then run
initdb on the leaf of the chain?

Yeah I think the three interesting cases were initdb when run under
junctions like these:

1. Volume GUID format: mklink /J foo \\?\Volume{12341234-1234...},
expected to break without patch
2. Chain: mklink /J C:\\aaa1 C:\\aaa2, mkdir /J C:\\aaa2 c:\\aaa3,
expected to break without patch
3. Chain of length > 8, expected to fail with ELOOP once the patch is applied.

(Syntax may be off, I just googled it but don't have Windows to try).

The way to get decent tests for this stuff and all the rest of the
wrappers would probably be to develop this test suite further:

/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com

I can confirm that these two patches apply cleanly to releases 13 .. 15,
that builds are also clean, and that tests 1..3 above pass/fail as
expected. Tested on a WS2019 instance.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#17Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#16)
Re: pg_rewind fails on Windows where tablespaces are used

On 2025-01-14 Tu 4:12 PM, Andrew Dunstan wrote:

On 2025-01-08 We 10:38 PM, Thomas Munro wrote:

On Thu, Jan 9, 2025 at 3:45 AM Andrew Dunstan <andrew@dunslane.net>
wrote:

Those patches didn't actually include any tests. I guess the best test
would be to create a chain of several junction points and then run
initdb on the leaf of the chain?

Yeah I think the three interesting cases were initdb when run under
junctions like these:

1.  Volume GUID format: mklink /J foo \\?\Volume{12341234-1234...},
expected to break without patch
2.  Chain: mklink /J C:\\aaa1 C:\\aaa2, mkdir /J C:\\aaa2 c:\\aaa3,
expected to break without patch
3.  Chain of length > 8, expected to fail with ELOOP once the patch
is applied.

(Syntax may be off, I just googled it but don't have Windows to try).

The way to get decent tests for this stuff and all the rest of the
wrappers would probably be to develop this test suite further:

/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com

I can confirm that these two patches apply cleanly to releases 13 ..
15, that builds are also clean, and that tests 1..3 above pass/fail as
expected. Tested on a WS2019 instance.

I have pushed the backpatch of these two patches into release 13 .. 15.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#18Michael Paquier
michael@paquier.xyz
In reply to: Andrew Dunstan (#17)
Re: pg_rewind fails on Windows where tablespaces are used

On Sat, Jan 18, 2025 at 09:37:07AM -0500, Andrew Dunstan wrote:

I have pushed the backpatch of these two patches into release 13 .. 15.

Just saw that, cool. Thanks!
--
Michael