pg_resetwal regression: could not upgrade after 1d863c2504

Started by Hayato Kuroda (Fujitsu)over 2 years ago2 messageshackers
Jump to latest
#1Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com

Dear hackers,
(CC: Peter Eisentraut - committer of the problematic commit)

While developing pg_upgrade patch, I found a candidate regression for pg_resetwal.
It might be occurred due to 1d863c2504.

Is it really regression, or am I missing something?

# Phenomenon

pg_resetwal with relative path cannot be executed. It could be done at 7273945,
but could not at 1d863.

At 1d863:

```
$ pg_resetwal -n data_N1/
pg_resetwal: error: could not read permissions of directory "data_N1/": No such file or directory
```

At 7273945:

```
$ pg_resetwal -n data_N1/
Current pg_control values:

pg_control version number: 1300
Catalog version number: 202309251
...
```

# Environment

Attached script was executed on RHEL 7.9, gcc was 8.3.1.
I used meson build system with following options:

meson setup -Dcassert=true -Ddebug=true -Dc_args="-ggdb -O0 -g3 -fno-omit-frame-pointer"

# My analysis

I found that below part in GetDataDirectoryCreatePerm() returns false, it was a
cause.

```
/*
* If an error occurs getting the mode then return false. The caller is
* responsible for generating an error, if appropriate, indicating that we
* were unable to access the data directory.
*/
if (stat(dataDir, &statBuf) == -1)
return false;
```

Also, I found that the value DataDir in main() has relative path.
Based on that, upcoming stat() may not able to detect the given location because
the process has already located inside the directory.

```
(gdb) break chdir
Breakpoint 1 at 0x4016f0
(gdb) run -n data_N1

...
Breakpoint 1, 0x00007ffff78e1390 in chdir () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) print DataDir
$1 = 0x7fffffffe25c "data_N1"
(gdb) frame 1
#1 0x00000000004028d7 in main (argc=3, argv=0x7fffffffdf58) at ../postgres/src/bin/pg_resetwal/pg_resetwal.c:348
348 if (chdir(DataDir) < 0)
(gdb) print DataDir
$2 = 0x7fffffffe25c "data_N1"
```

# How to fix

One alternative approach is to call chdir() several times. PSA the patch.
(I'm not sure the commit should be reverted)

# Appendix - How did I find?

Originally, I found an issue when attached script was executed.
It creates two clusters and executes pg_upgrade, but failed with following output.
(I also attached whole output, please see result_*.out)

```
Performing Consistency Checks
-----------------------------
Checking cluster versions ok
pg_resetwal: error: could not read permissions of directory "data_N1": No such file or directory
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

test.shapplication/octet-stream; name=test.shDownload
result_7273945ca.outapplication/octet-stream; name=result_7273945ca.outDownload
result_1d863c2504.outapplication/octet-stream; name=result_1d863c2504.outDownload
fix.patchapplication/octet-stream; name=fix.patchDownload+4-0
#2Peter Eisentraut
peter_e@gmx.net
In reply to: Hayato Kuroda (Fujitsu) (#1)
Re: pg_resetwal regression: could not upgrade after 1d863c2504

On 29.09.23 09:39, Hayato Kuroda (Fujitsu) wrote:

pg_resetwal with relative path cannot be executed. It could be done at 7273945,
but could not at 1d863.

Ok, I have reverted the offending patch.