problem with query

Started by Roberto Scattiniover 12 years ago13 messagesgeneral

roberto.scattini@gmail.com

over 12 years ago

hi, today we discovered that this query doesn't return the expected values:

SELECT DISTINCT
p.id, p.apellido AS "Apellido", p.nombre AS "Nombre", pf.nombre AS
"Funcion", to_char(da.f_ingreso_pg, 'dd/mm/yyyy') AS "Fecha Ingreso PG",
e.estado AS "Estado", to_char(pe.fecha, 'dd/mm/yyyy') AS "Fecha Estado"
FROM personal.personas AS p
LEFT JOIN personal.personas_da_pg AS da ON p.id=da.id_persona
LEFT JOIN personal.personas_estado AS pe ON da.estado_id=pe.id
LEFT JOIN personal.estados AS e ON pe.id_estado=e.id
LEFT JOIN procu_departamento AS d ON d.id=da.id_departamento
LEFT JOIN procu_departamento AS dto ON left(d.c_organigrama,
4)||'000'=dto.c_organigrama
LEFT JOIN procu_departamento AS dir ON left(d.c_organigrama,
3)||'0000'=dir.c_organigrama
LEFT JOIN procu_departamento AS dg ON left(d.c_organigrama,
2)||'00000'=dg.c_organigrama
LEFT JOIN personal.funciones AS pf ON pf.id=da.funcion_id
LEFT JOIN personal.profesiones AS pp ON pp.id=p.id_profesion
WHERE p.apellido ilike '%nuñez%'
ORDER BY "Apellido"

the exact same query with ilike '%NUÑEZ%' works OK...

we are using postgresql 9.1 from ubuntu packages and the database encoding
is (sadly) SQL_ASCII

can anybody point me in the right direction to solve this issue?

--
Roberto Scattini

Chris Curvey

chris@chriscurvey.com

over 12 years ago

In reply to: Roberto Scattini (#1)

Re: problem with query

On Thu, Sep 12, 2013 at 4:34 PM, Roberto Scattini <
roberto.scattini@gmail.com> wrote:

hi, today we discovered that this query doesn't return the expected values:

SELECT DISTINCT
p.id, p.apellido AS "Apellido", p.nombre AS "Nombre", pf.nombre AS
"Funcion", to_char(da.f_ingreso_pg, 'dd/mm/yyyy') AS "Fecha Ingreso PG",
e.estado AS "Estado", to_char(pe.fecha, 'dd/mm/yyyy') AS "Fecha Estado"
FROM personal.personas AS p
LEFT JOIN personal.personas_da_pg AS da ON p.id=da.id_persona
LEFT JOIN personal.personas_estado AS pe ON da.estado_id=pe.id
LEFT JOIN personal.estados AS e ON pe.id_estado=e.id
LEFT JOIN procu_departamento AS d ON d.id=da.id_departamento
LEFT JOIN procu_departamento AS dto ON left(d.c_organigrama,
4)||'000'=dto.c_organigrama
LEFT JOIN procu_departamento AS dir ON left(d.c_organigrama,
3)||'0000'=dir.c_organigrama
LEFT JOIN procu_departamento AS dg ON left(d.c_organigrama,
2)||'00000'=dg.c_organigrama
LEFT JOIN personal.funciones AS pf ON pf.id=da.funcion_id
LEFT JOIN personal.profesiones AS pp ON pp.id=p.id_profesion
WHERE p.apellido ilike '%nuñez%'
ORDER BY "Apellido"

the exact same query with ilike '%NUÑEZ%' works OK...

we are using postgresql 9.1 from ubuntu packages and the database encoding
is (sadly) SQL_ASCII

can anybody point me in the right direction to solve this issue?

--
Roberto Scattini

If you could tells what you are expecting, and what you are actually
getting, that would be helpful.

--
The person who says it cannot be done should not interrupt the person who
is doing it. -- Chinese Proverb

Roberto Scattini

roberto.scattini@gmail.com

over 12 years ago

In reply to: Chris Curvey (#2)

Re: problem with query

On Thu, Sep 12, 2013 at 6:22 PM, Chris Curvey <chris@chriscurvey.com> wrote:

On Thu, Sep 12, 2013 at 4:34 PM, Roberto Scattini <
roberto.scattini@gmail.com> wrote:

we are using postgresql 9.1 from ubuntu packages and the database
encoding is (sadly) SQL_ASCII

can anybody point me in the right direction to solve this issue?

--
Roberto Scattini

If you could tells what you are expecting, and what you are actually
getting, that would be helpful.

hi chris, sorry, the query with ilike '%NUÑEZ%' is returning 4 rows and
the query with ilike '%nuñez%' is returning 0 rows.

a simplified version of the query with his results:

pgn=# set client_encoding = SQL_ASCII;
SET
pgon=# SELECT DISTINCT
p.id, p.apellido, p.nombre
FROM personal.personas AS p
WHERE p.apellido ilike '%nuñez%';
id | apellido | nombre
----+----------+--------
(0 rows)

thanks!

--
Roberto Scattini

Chris Curvey

chris@chriscurvey.com

over 12 years ago

In reply to: Roberto Scattini (#3)

Re: problem with query

On Thu, Sep 12, 2013 at 5:33 PM, Roberto Scattini <
roberto.scattini@gmail.com> wrote:

On Thu, Sep 12, 2013 at 6:22 PM, Chris Curvey <chris@chriscurvey.com>wrote:

On Thu, Sep 12, 2013 at 4:34 PM, Roberto Scattini <
roberto.scattini@gmail.com> wrote:

we are using postgresql 9.1 from ubuntu packages and the database
encoding is (sadly) SQL_ASCII

can anybody point me in the right direction to solve this issue?

--
Roberto Scattini

If you could tells what you are expecting, and what you are actually
getting, that would be helpful.

hi chris, sorry, the query with ilike '%NUÑEZ%' is returning 4 rows and
the query with ilike '%nuñez%' is returning 0 rows.

a simplified version of the query with his results:

pgn=# set client_encoding = SQL_ASCII;
SET
pgon=# SELECT DISTINCT
p.id, p.apellido, p.nombre
FROM personal.personas AS p
WHERE p.apellido ilike '%nuñez%';
id | apellido | nombre
----+----------+--------
(0 rows)

pgon=# SELECT DISTINCT
p.id, p.apellido, p.nombre
FROM personal.personas AS p
WHERE p.apellido ilike '%NUÑEZ%';
id | apellido | nombre
-------+----------------+-----------------
39489 | NUÑEZ | JUAN
39937 | PEREZ NUÑEZ | FRANCISCO
39498 | NUÑEZ ESQUIBEL | RICARDO RODOLFO
40220 | NUÑEZ VERA | MANUEL SANTIAGO
(4 rows)

Is this just a case-sentitvity issue? if personas.apellido is a varchar
field, then I think that's your trouble. (it would have to be citext in
order for "nunez" = "NUNEZ".)

Bosco Rama

postgres@boscorama.com

over 12 years ago

In reply to: Chris Curvey (#4)

Re: problem with query

On 09/12/13 14:49, Chris Curvey wrote:

Is this just a case-sentitvity issue? if personas.apellido is a varchar
field, then I think that's your trouble. (it would have to be citext in
order for "nunez" = "NUNEZ".)

He's using 'ilike' in his query, so this is more likely to be a
locale/charset/collation/encoding issue.

Roberto, what encodings are in use by the client and the server/DB?

HTH,
Bosco.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Roberto Scattini

roberto.scattini@gmail.com

over 12 years ago

In reply to: Chris Curvey (#4)

Re: problem with query

On Thu, Sep 12, 2013 at 6:49 PM, Chris Curvey <chris@chriscurvey.com> wrote:

Is this just a case-sentitvity issue? if personas.apellido is a varchar
field, then I think that's your trouble. (it would have to be citext in
order for "nunez" = "NUNEZ".)

yes, is just a case-sensitivity issue. yes, is a varchar field.
but what i dont understand is why the problem of "insensitivity" is only
with ñ (lower). i mean, if i remove my ñ (and subsecuent chars), both
querys return the same rows:

what makes the 'ñ' char special that makes the queries the same when it is
not there?

--
Roberto Scattini

Roberto Scattini

roberto.scattini@gmail.com

over 12 years ago

In reply to: Bosco Rama (#5)

Re: problem with query

On Thu, Sep 12, 2013 at 7:02 PM, Bosco Rama <postgres@boscorama.com> wrote:

On 09/12/13 14:49, Chris Curvey wrote:

Is this just a case-sentitvity issue? if personas.apellido is a varchar
field, then I think that's your trouble. (it would have to be citext in
order for "nunez" = "NUNEZ".)

He's using 'ilike' in his query, so this is more likely to be a
locale/charset/collation/encoding issue.

Roberto, what encodings are in use by the client and the server/DB?

hi bosco, in fact i believe that i have that problem... but i cant
undestand why and how to fix it.

the database has SQL_ASCII encoding, and my client... i am not sure, when i
connect directly with psql from localhost i must set client encoding to
SQL_ASCII or i receive an "invalid utf8 sequence" error...

but the problem is there, from psql, from pgadmin and from php/apache (my
webservers)

thanks

--
Roberto Scattini

Rodrigo Gonzalez

rjgonzale.lists@gmail.com

over 12 years ago

In reply to: Roberto Scattini (#7)

Re: problem with query

On Thu, 12 Sep 2013 19:07:04 -0300
Roberto Scattini <roberto.scattini@gmail.com> wrote:

On Thu, Sep 12, 2013 at 7:02 PM, Bosco Rama <postgres@boscorama.com>
wrote:

On 09/12/13 14:49, Chris Curvey wrote:

Is this just a case-sentitvity issue? if personas.apellido is a
varchar field, then I think that's your trouble. (it would have
to be citext in order for "nunez" = "NUNEZ".)

He's using 'ilike' in his query, so this is more likely to be a
locale/charset/collation/encoding issue.

Roberto, what encodings are in use by the client and the server/DB?

hi bosco, in fact i believe that i have that problem... but i cant
undestand why and how to fix it.

the database has SQL_ASCII encoding, and my client... i am not sure,
when i connect directly with psql from localhost i must set client
encoding to SQL_ASCII or i receive an "invalid utf8 sequence" error...

but the problem is there, from psql, from pgadmin and from php/apache
(my webservers)

Without being an expert....dont use sql_ascii, dump, recreate the DB as
utf8 and you will get what you want.

thanks

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Bosco Rama

postgres@boscorama.com

over 12 years ago

In reply to: Roberto Scattini (#7)

Re: problem with query

On 09/12/13 15:07, Roberto Scattini wrote:

hi bosco, in fact i believe that i have that problem... but i cant
undestand why and how to fix it.

the database has SQL_ASCII encoding, and my client... i am not sure, when i
connect directly with psql from localhost i must set client encoding to
SQL_ASCII or i receive an "invalid utf8 sequence" error...

but the problem is there, from psql, from pgadmin and from php/apache (my
webservers)

Well, the bytes that get encoded (in UTF8) for lowercase 'enye' is
x'c3b1' and for the uppercase version is x'c391' thus when they are
compared as straight bytes they will never be considered equal wrt
case-insensitivity.

The "invalid utf8 sequence" error is due to, as it says, an invalid
utf8 sequence in your text strings. This is due to the lack of any
enforcement provided by the SQL_ASCII encoding that was probably in
effect when the data was initially inserted.

To get to a point where you can use your data in its proper encoding
you will need to identify which strings are invalid and then dump
the DB and restore it into an appropriately initialized DB. You could
use the restore process to help you find the bad strings but that is
tedious as all hell. Better to scan you strings somehow and locate the
ones that are invalid and fix them prior to the dump.

HTH,
Bosco.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10

John R Pierce

pierce@hogranch.com

over 12 years ago

In reply to: Roberto Scattini (#6)

Re: problem with query

On 9/12/2013 3:03 PM, Roberto Scattini wrote:

yes, is just a case-sensitivity issue. yes, is a varchar field.
but what i dont understand is why the problem of "insensitivity" is
only with ï¿½ (lower). i mean, if i remove my ï¿½ (and subsecuent chars),
both querys return the same rows:

SQLASCII means its just 8 bit octets, and only standard USASCII
characters are recognized, which doesn't include ï¿½ (what encoding is
that in, anyways?)

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

David G. Johnston

david.g.johnston@gmail.com

over 12 years ago

In reply to: Roberto Scattini (#6)

Re: problem with query

Roberto Scattini wrote

what makes the 'ñ' char special that makes the queries the same when it is
not there?

My knowledge here is a little rough around the edges but the following is
conceptually true:

For the most part legacy encodings (or non-encodings as this case
technically falls under) recognize only the explicit case-conversions for
the latin alphabet A-Z (mapped onto "a-z") without any accents.

To reasonably process strings/varchars/clobs that contain accented letters
it is necessary to use a more modern encoding - such as UTF-8/Unicode -
which contains the necessary logic to perform the additional conversions.

these should (not going to test it myself at this time) by
case-insensitively identical:

abcdëFGH
ABCDëfgh

since the "a-d, f-h" can be converted between and the one symbol that
cannot, "ë" is the same in both string....so it isn't that the ë breaks
things but rather that symbol has no upper-case alternative to equivalently
match against...just like numbers and symbols behave in the same situation.
To ASCII ë is just a symbol without any "letter of alphabet"
characteristics.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/problem-with-query-tp5770637p5770653.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12

Giuseppe Broccolo

giuseppe.broccolo@2ndquadrant.it

over 12 years ago

In reply to: Roberto Scattini (#1)

Re: problem with query

Il 12/09/2013 22:34, Roberto Scattini ha scritto:

hi, today we discovered that this query doesn't return the expected
values:

SELECT DISTINCT
p.id <http://p.id>, p.apellido AS "Apellido", p.nombre AS "Nombre",
pf.nombre AS "Funcion", to_char(da.f_ingreso_pg, 'dd/mm/yyyy') AS
"Fecha Ingreso PG", e.estado AS "Estado", to_char(pe.fecha,
'dd/mm/yyyy') AS "Fecha Estado"
FROM personal.personas AS p
LEFT JOIN personal.personas_da_pg AS da ON p.id
<http://p.id>=da.id_persona
LEFT JOIN personal.personas_estado AS pe ON da.estado_id=pe.id
<http://pe.id>
LEFT JOIN personal.estados AS e ON pe.id_estado=e.id <http://e.id>
LEFT JOIN procu_departamento AS d ON d.id
<http://d.id>=da.id_departamento
LEFT JOIN procu_departamento AS dto ON left(d.c_organigrama,
4)||'000'=dto.c_organigrama
LEFT JOIN procu_departamento AS dir ON left(d.c_organigrama,
3)||'0000'=dir.c_organigrama
LEFT JOIN procu_departamento AS dg ON left(d.c_organigrama,
2)||'00000'=dg.c_organigrama
LEFT JOIN personal.funciones AS pf ON pf.id <http://pf.id>=da.funcion_id
LEFT JOIN personal.profesiones AS pp ON pp.id
<http://pp.id>=p.id_profesion
WHERE p.apellido ilike '%nuï¿½ez%'
ORDER BY "Apellido"

The problem is the encoding: SQL_ASCII encodes only the first 128
characters, so 'ï¿½' and 'ï¿½' cannot be encoded in ASCII. If you insert
text 'ï¿½' or 'ï¿½' in a table inside a database (e.g. 'ascii_test') with
SQL_ASCII encoding, they will be seen as two distinct characters:

ascii_test=# SELECT length(E'ï¿½');
length
--------
2
(1 row)
ascii_test=# SELECT length(E'ï¿½');
length
--------
2
(1 row)
ascii_test=# SELECT 'ï¿½'::bytea;
bytea
--------
\xc3b1
(1 row)
ascii_test=# SELECT 'ï¿½'::bytea;
bytea
--------
\xc391
(1 row)

Hexadecimals 'c3', 'b1' and '91' does not decode anything in ASCII, even
if terminal show pairs as 'ï¿½' or 'ï¿½'. So ILIKE operator cannot apply
case insensitive pattern matching to data which does not encode any
string (in the SQL_ASCII encoding) and works as a normal LIKE.

Even if the client which insert data has 8-bit encoding (UTF8, where 'ï¿½'
and 'ï¿½' are decoded), SQL_ASCII database cannot convert strings anyway,
and problem remains.

The best way is to work with a UTF8 encoded database. Is there a
particular reason to work with SQL_ASCII encoding?

Giuseppe.

--
Giuseppe Broccolo - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
giuseppe.broccolo@2ndQuadrant.it | www.2ndQuadrant.it

#13

Roberto Scattini

roberto.scattini@gmail.com

over 12 years ago

In reply to: Giuseppe Broccolo (#12)

Re: problem with query

hi giuseppe,

On Fri, Sep 13, 2013 at 11:49 AM, Giuseppe Broccolo <
giuseppe.broccolo@2ndquadrant.it> wrote:

The problem is the encoding: SQL_ASCII encodes only the first 128
characters, so 'ñ' and 'Ñ' cannot be encoded in ASCII. If you insert text
'ñ' or 'Ñ' in a table inside a database (e.g. 'ascii_test') with SQL_ASCII
encoding, they will be seen as two distinct characters:

Hexadecimals 'c3', 'b1' and '91' does not decode anything in ASCII, even
if terminal show pairs as 'ñ' or 'Ñ'. So ILIKE operator cannot apply case
insensitive pattern matching to data which does not encode any string (in
the SQL_ASCII encoding) and works as a normal LIKE.

Even if the client which insert data has 8-bit encoding (UTF8, where 'ñ'
and 'Ñ' are decoded), SQL_ASCII database cannot convert strings anyway, and
problem remains.

The best way is to work with a UTF8 encoded database. Is there a
particular reason to work with SQL_ASCII encoding?

no, the only reason until now has been "If it ain't broke, don't fix it."...
so maybe it is time to start thinking in migrate the encoding to utf8...

is there any normal procedure to do this?

dump and then restore in a freshly initiated database should be enough?

thanks!

--
Roberto Scattini