Example non-Latin words for text search parser docs?

Started by Tom Laneover 18 years ago14 messagesdocs
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

I'm afraid my English-centricity is showing, but I could use a little
help filling in the missing examples in the table here:
http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
I'm not sure of a suitable example all-non-ASCII-letters word, and
even less sure of how to represent it in SGML. (I remember we had
quite a bit of trouble dealing with accented letters in people's names,
for instance.)

regards, tom lane

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: Example non-Latin words for text search parser docs?

Tom Lane wrote:

I'm afraid my English-centricity is showing, but I could use a little
help filling in the missing examples in the table here:
http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
I'm not sure of a suitable example all-non-ASCII-letters word,

It's easy to find an example -- I went to the english Wikipedia,
searched for "elephant", then clicked on the russian link at the left.
It gives you "Слоновые", which I see on my terminal as a series of black
squares :-) so there's not a single latin letter in it.

http://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D0%BE%D0%BD%D0%BE%D0%B2%D1%8B%D0%B5

In that page they also mention the word "Слон" which looks like "Slon".

and even less sure of how to represent it in SGML. (I remember we had
quite a bit of trouble dealing with accented letters in people's
names, for instance.)

Yeah, that will prove difficult.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: Example non-Latin words for text search parser docs?

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

and even less sure of how to represent it in SGML. (I remember we had
quite a bit of trouble dealing with accented letters in people's
names, for instance.)

Yeah, that will prove difficult.

This problem largely goes away if we redefine the word categories as
under discussion in the -hackers thread: with any of the proposed
alternatives it'd be pretty easy to make up real words that are easily
representable in SGML.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: Example non-Latin words for text search parser docs?

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

I'm afraid my English-centricity is showing, but I could use a little
help filling in the missing examples in the table here:
http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
I'm not sure of a suitable example all-non-ASCII-letters word,

It's easy to find an example -- I went to the english Wikipedia,
searched for "elephant", then clicked on the russian link at the left.
It gives you "Слоновые", which I see on my terminal as a series of black
squares :-) so there's not a single latin letter in it.

Given the just-applied changes in the definition of a "word", we no
longer need a totally-not-ASCII sample word. But I wonder if anyone
has a better idea than the f&oslash;&oslash; that I made up on the
spot...

regards, tom lane

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: Example non-Latin words for text search parser docs?

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

I'm afraid my English-centricity is showing, but I could use a little
help filling in the missing examples in the table here:
http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
I'm not sure of a suitable example all-non-ASCII-letters word,

It's easy to find an example -- I went to the english Wikipedia,
searched for "elephant", then clicked on the russian link at the left.
It gives you "Слоновые", which I see on my terminal as a series of black
squares :-) so there's not a single latin letter in it.

Given the just-applied changes in the definition of a "word", we no
longer need a totally-not-ASCII sample word. But I wonder if anyone
has a better idea than the f&oslash;&oslash; that I made up on the
spot...

Actually I was wondering if we should use actual words. So instead of
"foo" we could use "elephant" for asciiword and "Éléphant" (french) for
word. And for the hword, "sous-espèces" (which appears on the French
Wikipedia) would do.

--
Alvaro Herrera http://www.flickr.com/photos/alvherre/
"La espina, desde que nace, ya pincha" (Proverbio africano)

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#5)
Re: Example non-Latin words for text search parser docs?

Alvaro Herrera <alvherre@commandprompt.com> writes:

Actually I was wondering if we should use actual words. So instead of
"foo" we could use "elephant" for asciiword and "Éléphant" (french) for
word. And for the hword, "sous-espèces" (which appears on the French
Wikipedia) would do.

Hmm ... I see a potential problem with that, which is that if someone
happened to be viewing the page on something that dropped the accents,
or even just made them too small to be easily readable, the examples
wouldn't make any sense at all.

I have no problem with "elephant" as a sample asciiword, but for the
sample non-ascii word I'd suggest something that (a) is clearly not
English and (b) as much as possible, everybody knows has an accent.
At least in large parts of the US, something like "mañana" would
work nicely.

Anyway, feel free to hack on it --- I'm getting a bit weary of looking
at that chapter.

regards, tom lane

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: Example non-Latin words for text search parser docs?

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Actually I was wondering if we should use actual words. So instead of
"foo" we could use "elephant" for asciiword and "�l�phant" (french) for
word. And for the hword, "sous-esp�ces" (which appears on the French
Wikipedia) would do.

Hmm ... I see a potential problem with that, which is that if someone
happened to be viewing the page on something that dropped the accents,
or even just made them too small to be easily readable, the examples
wouldn't make any sense at all.

I have no problem with "elephant" as a sample asciiword, but for the
sample non-ascii word I'd suggest something that (a) is clearly not
English and (b) as much as possible, everybody knows has an accent.
At least in large parts of the US, something like "ma�ana" would
work nicely.

OK I went with that. I also used real spanish hyphenated words in the
hword examples. I also changed the domains foo.com to example.com, just
because I'm anal enough to do it.

The hword_asciipart I'm not 100% sure about. I used this:

militar in the context pol�tico-militar, or postgresql in the
context postgresql-beta1

What I wanted to emphasize here is that it's the "ascii-ness" of the
part that matters, not that of the complete token. The reason I'm not
sure about it is that it makes the table wider.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#7)
Re: Example non-Latin words for text search parser docs?

Alvaro Herrera <alvherre@commandprompt.com> writes:

The hword_asciipart I'm not 100% sure about. I used this:
militar in the context pol�tico-militar, or postgresql in the
context postgresql-beta1

Hmm ... I went and looked at the page on developer.postgresql.org,
and it's just as I feared: with slightly bleary morning eyes, the
accents over the i's are not obvious, and so you have to look *real*
close before you get the point of the examples. It doesn't help that
'politico' with no accent is exactly how the phrase would be spelled
in English, and so it's easy to not see the accent because you're not
expecting one. The other examples seem alright, but I think that one's
a bad choice.

regards, tom lane

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#8)
Re: Example non-Latin words for text search parser docs?

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

The hword_asciipart I'm not 100% sure about. I used this:
militar in the context pol�tico-militar, or postgresql in the
context postgresql-beta1

Hmm ... I went and looked at the page on developer.postgresql.org,
and it's just as I feared: with slightly bleary morning eyes, the
accents over the i's are not obvious, and so you have to look *real*
close before you get the point of the examples. It doesn't help that
'politico' with no accent is exactly how the phrase would be spelled
in English, and so it's easy to not see the accent because you're not
expecting one. The other examples seem alright, but I think that one's
a bad choice.

Damn. Ok, I'll search for a different example. We're making progress
nonetheless ;-)

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#9)
Re: Example non-Latin words for text search parser docs?

Alvaro Herrera wrote:

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

The hword_asciipart I'm not 100% sure about. I used this:
militar in the context pol?tico-militar, or postgresql in the
context postgresql-beta1

Hmm ... I went and looked at the page on developer.postgresql.org,
and it's just as I feared: with slightly bleary morning eyes, the
accents over the i's are not obvious, and so you have to look *real*
close before you get the point of the examples. It doesn't help that
'politico' with no accent is exactly how the phrase would be spelled
in English, and so it's easy to not see the accent because you're not
expecting one. The other examples seem alright, but I think that one's
a bad choice.

Damn. Ok, I'll search for a different example. We're making progress
nonetheless ;-)

How about "l�gico-matem�tica"?

(If that one doesn't work for you, maybe we should look into words in
another language, more different from english. Maybe Magnus can suggest
hyphenated words with weird letters).

--
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"La rebeld�a es la virtud original del hombre" (Arthur Schopenhauer)

#11Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#8)
Re: Example non-Latin words for text search parser docs?

Am Donnerstag, 25. Oktober 2007 schrieb Tom Lane:

Hmm ... I went and looked at the page on developer.postgresql.org,
and it's just as I feared: with slightly bleary morning eyes, the
accents over the i's are not obvious, and so you have to look *real*
close before you get the point of the examples.

By that standard, you will have to use non-Latin letters, which might decrease
the usability of the examples much more. There are not likely to be any
Latin-looking letters that are not ASCII and are not resembling another Latin
letter.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#11)
Re: Example non-Latin words for text search parser docs?

Peter Eisentraut wrote:

Am Donnerstag, 25. Oktober 2007 schrieb Tom Lane:

Hmm ... I went and looked at the page on developer.postgresql.org,
and it's just as I feared: with slightly bleary morning eyes, the
accents over the i's are not obvious, and so you have to look *real*
close before you get the point of the examples.

By that standard, you will have to use non-Latin letters, which might decrease
the usability of the examples much more. There are not likely to be any
Latin-looking letters that are not ASCII and are not resembling another Latin
letter.

I think it would suffice to use an accent over a vowel that's not an i.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#12)
Re: Example non-Latin words for text search parser docs?

Alvaro Herrera <alvherre@commandprompt.com> writes:

Peter Eisentraut wrote:

Am Donnerstag, 25. Oktober 2007 schrieb Tom Lane:

Hmm ... I went and looked at the page on developer.postgresql.org,
and it's just as I feared: with slightly bleary morning eyes, the
accents over the i's are not obvious, and so you have to look *real*
close before you get the point of the examples.

By that standard, you will have to use non-Latin letters, which might decrease
the usability of the examples much more. There are not likely to be any
Latin-looking letters that are not ASCII and are not resembling another Latin
letter.

I think it would suffice to use an accent over a vowel that's not an i.

Yeah, that would help. But the real problem with pol?tico-militar
is that it looks way too much like the English equivalent --- my first
reaction was "huh, he forgot the 'y'". I'm after a word that *looks*
not-English. Alvaro's comment that maybe we need to look to something
besides Spanish seems on point.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#10)
Re: Example non-Latin words for text search parser docs?

Alvaro Herrera <alvherre@commandprompt.com> writes:

How about "l�gico-matem�tica"?

Works for me.

regards, tom lane