pg_upgrade: make the locale comparison more tolerant

Started by Pavel Raiskupabout 12 years ago6 messages

praiskup@redhat.com

about 12 years ago

1 attachment(s)

Hello pg-hackers!

I tried to look at the problem resulting in changes mentioned here:
/messages/by-id/20121002155857.GE30089@momjian.us

If the system locale is changed e.g. from en_US.utf8 to en_US.utf-8 before
upgrading the data stack for newer server, pg_upgrade fails. It is quite
awkward to ask users to change locale.conf contents just for upgrade
purposes; is there other known workaround?

I attached patch aimed to remove this hurdle. After its application, the
in-place upgrade finished successfully for me under mentioned conditions,
though I am not 100% sure about its correctness.

Thanks, Pavel

Attachments:

0001-pg_upgrade-tolerate-more-locale-spelling.patchtext/x-patch; charset=UTF-8; name=0001-pg_upgrade-tolerate-more-locale-spelling.patchDownload

>From e54180b146edba871a9a4389f7cada0df1674587 Mon Sep 17 00:00:00 2001
From: Pavel Raiskup <praiskup@redhat.com>
Date: Sat, 21 Dec 2013 01:27:01 +0100
Subject: [PATCH] pg_upgrade: more tolerating locale comparison

Locale strings specified like 'cs_CZ.utf8' and 'cs_CZ.UTF-8'
should be treat as equivalent.
This patch switches from case-insensitive comparison of the whole
locale strings to more tolerant comparison:  It tries separately
compare the right side of the locale string (after the dot
character; the encoding part).  That string is firstly decoded by
pg_valid_client_encoding() from  pg_wchar.h and then are compared
corresponding codes.  Remaining part of locale string is case
insensitive string comparison.  If the encoding detection is not
possible (e.g. no encoding part of locale string), keep the
complete case insensitive comparison.

This resolves yet another complication of in-place upgrading.

---
 contrib/pg_upgrade/check.c | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c
index 411689a..697a16a 100644
--- a/contrib/pg_upgrade/check.c
+++ b/contrib/pg_upgrade/check.c
@@ -9,6 +9,7 @@
 
 #include "postgres_fe.h"
 
+#include "mb/pg_wchar.h"
 #include "pg_upgrade.h"
 
 
@@ -393,6 +394,36 @@ set_locale_and_encoding(ClusterInfo *cluster)
 	PQfinish(conn);
 }
 
+/*
+ * equivalent_locale()
+ *
+ * Best effort locale comparison.  Return false if we are not 100% sure the
+ * locale is equivalent.
+ */
+static bool
+equivalent_locale(const char *loca, const char *locb)
+{
+	int enca, encb;
+	const char *chara, *charb;
+	int lencmp;
+
+	if (!(chara = strrchr(loca, '.')) ||
+		!(charb = strrchr(locb, '.')))
+		/* locale string does not contain encoding part */
+		return (pg_strcasecmp(loca, locb) == 0);
+
+	chara++;
+	charb++;
+
+	enca = pg_valid_server_encoding(chara);
+	encb = pg_valid_server_encoding(charb);
+
+	if (enca < 0 || encb < 0 || enca != encb)
+		return (pg_strcasecmp(loca, locb) == 0);
+
+	lencmp = chara - loca;
+	return (pg_strncasecmp(loca, locb, lencmp) == 0);
+}
 
 /*
  * check_locale_and_encoding()
@@ -409,13 +440,13 @@ check_locale_and_encoding(ControlData *oldctrl,
 	 * They also often use inconsistent hyphenation, which we cannot fix, e.g.
 	 * UTF-8 vs. UTF8, so at least we display the mismatching values.
 	 */
-	if (pg_strcasecmp(oldctrl->lc_collate, newctrl->lc_collate) != 0)
+	if (!equivalent_locale(oldctrl->lc_collate, newctrl->lc_collate))
 		pg_fatal("lc_collate cluster values do not match:  old \"%s\", new \"%s\"\n",
 			   oldctrl->lc_collate, newctrl->lc_collate);
-	if (pg_strcasecmp(oldctrl->lc_ctype, newctrl->lc_ctype) != 0)
+	if (!equivalent_locale(oldctrl->lc_ctype, newctrl->lc_ctype))
 		pg_fatal("lc_ctype cluster values do not match:  old \"%s\", new \"%s\"\n",
 			   oldctrl->lc_ctype, newctrl->lc_ctype);
-	if (pg_strcasecmp(oldctrl->encoding, newctrl->encoding) != 0)
+	if (!equivalent_locale(oldctrl->encoding, newctrl->encoding))
 		pg_fatal("encoding cluster values do not match:  old \"%s\", new \"%s\"\n",
 			   oldctrl->encoding, newctrl->encoding);
 }
-- 
1.8.4.2

Pavel Raiskup

praiskup@redhat.com

about 12 years ago

In reply to: Pavel Raiskup (#1)

1 attachment(s)

Re: pg_upgrade: make the locale comparison more tolerating

I made the patch to be in context format and I'll add this issue to
CommitFest. Also, some code comments and style was changed to meet better
readability.

Attachments:

0001-pg_upgrade-make-the-locale-comparison-more-toleratin.patchtext/x-patch; charset=UTF-8; name=0001-pg_upgrade-make-the-locale-comparison-more-toleratin.patchDownload

>From 40c6cb64f3dd974c9ee48baf0900f01f62f65291 Mon Sep 17 00:00:00 2001
From: Pavel Raiskup <praiskup@redhat.com>
Date: Sat, 21 Dec 2013 01:27:01 +0100
Subject: [PATCH] pg_upgrade: make the locale comparison more tolerating

Locale strings specified like 'cs_CZ.utf8' and 'cs_CZ.UTF-8'
should be treat as equivalent.  Absence of taking these as
equivalents caused fail during major server upgrade (when the
server machine has a little different encoding then the not yet
actualized data stack).  Workaround for that was changing the
system locale to match the previous locale string.

Applying of this commit makes the comparison to be done in two
phases.  Firstly is compared the encoding part of the locale
string (if any) and then the rest of string.  Before the encoding
part is compared, it is decoded into precisely defined code from
'enum pg_enc'.  This should make the comparison more stable even
for completely different spelling of encoding (e.g. 'latin2' and
'iso 8859-2').

References:
3356208.RhzgiJ6fXA@nb.usersys.redhat.com
20121002155857.GE30089@momjian.us
---
 contrib/pg_upgrade/check.c | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c
new file mode 100644
index a706708..e3b7cda
*** a/contrib/pg_upgrade/check.c
--- b/contrib/pg_upgrade/check.c
***************
*** 9,14 ****
--- 9,15 ----
  
  #include "postgres_fe.h"
  
+ #include "mb/pg_wchar.h"
  #include "pg_upgrade.h"
  
  
*************** set_locale_and_encoding(ClusterInfo *clu
*** 393,398 ****
--- 394,429 ----
  	PQfinish(conn);
  }
  
+ /*
+  * equivalent_locale()
+  *
+  * Best effort locale comparison.  Return false if we are not 100% sure the
+  * locale is equivalent.
+  */
+ static bool
+ equivalent_locale(const char *loca, const char *locb)
+ {
+ 	int enca, encb;
+ 	int lencmp;
+ 
+ 	const char *chara = strrchr(loca, '.');
+ 	const char *charb = strrchr(locb, '.');
+ 	if (!chara || !charb)
+ 		/* not both locale strings do contain encoding part */
+ 		return (pg_strcasecmp(loca, locb) == 0);
+ 
+ 	chara++;
+ 	charb++;
+ 
+ 	enca = pg_valid_server_encoding(chara);
+ 	encb = pg_valid_server_encoding(charb);
+ 
+ 	if (enca < 0 || encb < 0 || enca != encb)
+ 		return (pg_strcasecmp(loca, locb) == 0);
+ 
+ 	lencmp = chara - loca;
+ 	return (pg_strncasecmp(loca, locb, lencmp) == 0);
+ }
  
  /*
   * check_locale_and_encoding()
*************** check_locale_and_encoding(ControlData *o
*** 409,421 ****
  	 * They also often use inconsistent hyphenation, which we cannot fix, e.g.
  	 * UTF-8 vs. UTF8, so at least we display the mismatching values.
  	 */
! 	if (pg_strcasecmp(oldctrl->lc_collate, newctrl->lc_collate) != 0)
  		pg_fatal("lc_collate cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->lc_collate, newctrl->lc_collate);
! 	if (pg_strcasecmp(oldctrl->lc_ctype, newctrl->lc_ctype) != 0)
  		pg_fatal("lc_ctype cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->lc_ctype, newctrl->lc_ctype);
! 	if (pg_strcasecmp(oldctrl->encoding, newctrl->encoding) != 0)
  		pg_fatal("encoding cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->encoding, newctrl->encoding);
  }
--- 440,452 ----
  	 * They also often use inconsistent hyphenation, which we cannot fix, e.g.
  	 * UTF-8 vs. UTF8, so at least we display the mismatching values.
  	 */
! 	if (!equivalent_locale(oldctrl->lc_collate, newctrl->lc_collate))
  		pg_fatal("lc_collate cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->lc_collate, newctrl->lc_collate);
! 	if (!equivalent_locale(oldctrl->lc_ctype, newctrl->lc_ctype))
  		pg_fatal("lc_ctype cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->lc_ctype, newctrl->lc_ctype);
! 	if (!equivalent_locale(oldctrl->encoding, newctrl->encoding))
  		pg_fatal("encoding cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->encoding, newctrl->encoding);
  }
-- 
1.8.4.2

Rushabh Lathia

rushabh.lathia@gmail.com

almost 12 years ago

In reply to: Pavel Raiskup (#2)

Re: pg_upgrade: make the locale comparison more tolerating

Hello,

I started reviewing the patch. Go through the original mail thread to
understand
the need of fix and the actual problem.

/messages/by-id/20121002155857.GE30089@momjian.us

Patch is using pg_valid_server_encoding() to compare the encoding which
looks
more correct. Did code walk through and it looks good to me. I tried to test
the patch on CentOS and its working fine. I am not quite knowledgeable
about other OS so on that perspective would good to have others view.

Here are the comment about the patch:

.) Patch gets cleanly apply on master ( using patch -p1 )
.) compilation done successful
.) Code walk through and logic looks good
.) Manual testing worked good for me

To test the issue I set the old database locale to en_US.utf8 and for new
database
locale to en_US.utf-8. WIth this when I tried pg_upgrade it failed with
"lc_collate cluster
values do not match: old "en_US.utf8", new "en_US.UTF-8". With the patch
pg_upgrade
running fine.

Regards,

Pavel Raiskup

praiskup@redhat.com

almost 12 years ago

In reply to: Pavel Raiskup (#1)

1 attachment(s)

Re: pg_upgrade: make the locale comparison more tolerating

Rushabh, really sorry I have to re-create the patch and thanks a
lot for looking at it!

Looking at the patch once again, I see that there were at least two
problems. Firstly, I used the equivalent_locale function also on the
encoding values. Even if that should not cause bugs (as it should result
in strncasecmp anyway), it was not pretty..

The second problem was assuming that the locale specifier "A" is not
longer then locale specifier B. Comparisons like 'en_US.utf8' with
'en_US_.utf8' would result in success. Bug resulting from this mistake is
not real probably but it is not nice anyway..

Rather cleaning the patch once more, attached,
Pavel

Attachments:

0001-pg_upgrade-make-the-locale-comparison-more-toleratin.patchtext/x-patch; charset=UTF-8; name=0001-pg_upgrade-make-the-locale-comparison-more-toleratin.patchDownload

>From 35b9f600b592db24bb0e25d168bc5955087d65df Mon Sep 17 00:00:00 2001
From: Pavel Raiskup <praiskup@redhat.com>
Date: Sat, 21 Dec 2013 01:27:01 +0100
Subject: [PATCH] pg_upgrade: make the locale comparison more tolerating

Locale strings specified like 'cs_CZ.utf8' and 'cs_CZ.UTF-8'
should be treat as equivalent.  Absence of taking these as
equivalents caused fail during major server upgrade (when the
server machine has a little different encoding then the not yet
actualized data stack).  Workaround for that was changing the
system locale to match the previous locale string.

Applying of this commit makes the comparison to be done in two
phases.  Firstly is compared the encoding part of the locale
string (if any) and then the rest of string.  Before the encoding
part is compared, it is decoded into precisely defined code from
'enum pg_enc'.  This should make the comparison more stable even
for completely different spelling of encoding (e.g. 'latin2' and
'iso 8859-2').

References:
3356208.RhzgiJ6fXA@nb.usersys.redhat.com
20121002155857.GE30089@momjian.us
---
 contrib/pg_upgrade/check.c | 58 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 55 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c
new file mode 100644
index a706708..2adefb2
*** a/contrib/pg_upgrade/check.c
--- b/contrib/pg_upgrade/check.c
***************
*** 9,14 ****
--- 9,15 ----
  
  #include "postgres_fe.h"
  
+ #include "mb/pg_wchar.h"
  #include "pg_upgrade.h"
  
  
*************** set_locale_and_encoding(ClusterInfo *clu
*** 393,398 ****
--- 394,450 ----
  	PQfinish(conn);
  }
  
+ /*
+  * equivalent_encoding()
+  *
+  * Best effort encoding comparison.  Return true only if the encodings both
+  * are correctly spelled and when they are equivalent.
+  */
+ static bool
+ equivalent_encoding(const char *chara, const char *charb)
+ {
+ 	int			enca = pg_valid_server_encoding(chara);
+ 	int			encb = pg_valid_server_encoding(charb);
+ 
+ 	if (enca < 0 || encb < 0)
+ 		return false;
+ 
+ 	return (enca == encb);
+ }
+ 
+ /*
+  * equivalent_locale()
+  *
+  * Best effort locale comparison.  Return false if we are not 100% sure the
+  * locale is equivalent.
+  */
+ static bool
+ equivalent_locale(const char *loca, const char *locb)
+ {
+ 	int			lencmp;
+ 	const char *chara = strrchr(loca, '.');
+ 	const char *charb = strrchr(locb, '.');
+ 
+ 	if (!chara || !charb)
+ 		/* not both locale strings do contain encoding part */
+ 		return (pg_strcasecmp(loca, locb) == 0);
+ 	chara++;
+ 	charb++;
+ 
+ 	if (!equivalent_encoding(chara, charb))
+ 		return false;
+ 
+ 	/*
+ 	 * We know the encoding part is equivalent.  So now compare only the
+ 	 * locale identifier (e.g. en_US part of en_US.utf8).
+ 	 */
+ 
+ 	lencmp = chara - loca;
+ 	if (lencmp != charb - locb)
+ 		return false;
+ 
+ 	return (pg_strncasecmp(loca, locb, lencmp) == 0);
+ }
  
  /*
   * check_locale_and_encoding()
*************** check_locale_and_encoding(ControlData *o
*** 409,421 ****
  	 * They also often use inconsistent hyphenation, which we cannot fix, e.g.
  	 * UTF-8 vs. UTF8, so at least we display the mismatching values.
  	 */
! 	if (pg_strcasecmp(oldctrl->lc_collate, newctrl->lc_collate) != 0)
  		pg_fatal("lc_collate cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->lc_collate, newctrl->lc_collate);
! 	if (pg_strcasecmp(oldctrl->lc_ctype, newctrl->lc_ctype) != 0)
  		pg_fatal("lc_ctype cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->lc_ctype, newctrl->lc_ctype);
! 	if (pg_strcasecmp(oldctrl->encoding, newctrl->encoding) != 0)
  		pg_fatal("encoding cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->encoding, newctrl->encoding);
  }
--- 461,473 ----
  	 * They also often use inconsistent hyphenation, which we cannot fix, e.g.
  	 * UTF-8 vs. UTF8, so at least we display the mismatching values.
  	 */
! 	if (!equivalent_locale(oldctrl->lc_collate, newctrl->lc_collate))
  		pg_fatal("lc_collate cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->lc_collate, newctrl->lc_collate);
! 	if (!equivalent_locale(oldctrl->lc_ctype, newctrl->lc_ctype))
  		pg_fatal("lc_ctype cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->lc_ctype, newctrl->lc_ctype);
! 	if (!equivalent_encoding(oldctrl->encoding, newctrl->encoding))
  		pg_fatal("encoding cluster values do not match:  old \"%s\", new \"%s\"\n",
  			   oldctrl->encoding, newctrl->encoding);
  }
-- 
1.8.5.3

Rushabh Lathia

rushabh.lathia@gmail.com

almost 12 years ago

In reply to: Pavel Raiskup (#4)

Re: pg_upgrade: make the locale comparison more tolerating

Hi Pavel,

I looked at your latest cleanup patch. Yes its looks more cleaner in term
equivalent_locale & equivalent_encoding separate functions.

I did run the test again on latest patch and all looks good.

Marking it as ready for committer.

Regards,

On Fri, Jan 24, 2014 at 9:49 PM, Pavel Raiskup <praiskup@redhat.com> wrote:

Rushabh, really sorry I have to re-create the patch and thanks a
lot for looking at it!

Looking at the patch once again, I see that there were at least two
problems. Firstly, I used the equivalent_locale function also on the
encoding values. Even if that should not cause bugs (as it should result
in strncasecmp anyway), it was not pretty..

The second problem was assuming that the locale specifier "A" is not
longer then locale specifier B. Comparisons like 'en_US.utf8' with
'en_US_.utf8' would result in success. Bug resulting from this mistake is
not real probably but it is not nice anyway..

Rather cleaning the patch once more, attached,
Pavel

--
Rushabh Lathia

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Pavel Raiskup (#4)

Re: pg_upgrade: make the locale comparison more tolerating

Pavel Raiskup <praiskup@redhat.com> writes:

[ 0001-pg_upgrade-make-the-locale-comparison-more-toleratin.patch ]

Committed with minor adjustments (mostly, wordsmithing the comments).

Although this was categorized as a bug fix, I'm not sure it's worth
taking the risk of back-patching. We've not seen very many reports
of problems of this type. Of course, you're free to carry it as a
patch in Red Hat's packages if you differ ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers