tsearch filenames unlikes special symbols and numbers

Started by Pavel Stehuleover 18 years ago42 messages
#1Pavel Stehule
pavel.stehule@gmail.com

Hello

I am found small bug

postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,
DictFile= 'cs_czutf');
ERROR: invalid text search configuration file name "cs_czutf"
postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,
DictFile= 'csczutf8');
ERROR: invalid text search configuration file name "csczutf8"
postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,
DictFile= "csczutf8");
ERROR: invalid text search configuration file name "csczutf8"
postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,
DictFile= "cs_czutf");
ERROR: invalid text search configuration file name "cs_czutf"
postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,
DictFile= "csczutf");
ERROR: could not open dictionary file
"/usr/local/pgsql/share/tsearch_data/csczutf.dict": není souborem ani
adresářem

regards
Pavel Stehule

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Pavel Stehule (#1)
Re: tsearch filenames unlikes special symbols and numbers

I just tried on CVS HEAD and seems something is broken

postgres=# CREATE TEXT SEARCH DICTIONARY ru_ispell (
TEMPLATE = ispell,
DictFile = russian-utf8.dict,
AffFile = russian-utf8.aff,
StopWords = russian
);
ERROR: syntax error at or near "-"
LINE 3: DictFile = russian-utf8.dict,

postgres=# CREATE TEXT SEARCH DICTIONARY ru_ispell (
TEMPLATE = ispell,
DictFile = 'russian-utf8.dict',
AffFile = 'russian-utf8.aff',
StopWords = russian
);
ERROR: invalid text search configuration file name "russian-utf8.dict"

Honestly speaking, I have no time to follow constantly changed syntax,
but documentation
http://momjian.us/main/writings/pgsql/sgml/sql-createtsdictionary.html
doesn't make clear what's wrong.

Also, I'm wondering do we really need to show all schemas without
text search configurations defined ? Looks rather stranger.

postgres=# \dF
List of text search configurations
Schema | Name | Description
--------------------+------------+---------------------------------------
information_schema | |
pg_catalog | danish | Configuration for danish language
pg_catalog | dutch | Configuration for dutch language
pg_catalog | english | Configuration for english language
pg_catalog | finnish | Configuration for finnish language
pg_catalog | french | Configuration for french language
pg_catalog | german | Configuration for german language
pg_catalog | hungarian | Configuration for hungarian language
pg_catalog | italian | Configuration for italian language
pg_catalog | norwegian | Configuration for norwegian language
pg_catalog | portuguese | Configuration for portuguese language
pg_catalog | romanian | Configuration for romanian language
pg_catalog | russian | Configuration for russian language
pg_catalog | simple | simple configuration
pg_catalog | spanish | Configuration for spanish language
pg_catalog | swedish | Configuration for swedish language
pg_catalog | turkish | Configuration for turkish language
pg_temp_1 | |
pg_toast | |
pg_toast_temp_1 | |
public | |
(21 rows)

Another problem I see are broken examples of dictionary and parser in
documentation:
http://momjian.us/main/writings/pgsql/sgml/textsearch-rule-dictionary-example.html
http://momjian.us/main/writings/pgsql/sgml/textsearch-parser-example.html

Include files in dictionary example are now in tsearch directory:

#include "tsearch/ts_locale.h"
#include "tsearch/ts_public.h"
#include "tsearch/ts_utils.h"

I didn't test parser example.

Oleg

PS. Sorry, I miss last syntax changes, but I really don't understand
parenthesis and commas usage in SQL. It's so strange.
I remember Peter raised an objections at the very beginning.

On Sun, 2 Sep 2007, Pavel Stehule wrote:

Hello
I am found small bug
postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,DictFile= 'cs_czutf');ERROR: invalid text search configuration file name "cs_czutf"postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,DictFile= 'csczutf8');ERROR: invalid text search configuration file name "csczutf8"postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,DictFile= "csczutf8");ERROR: invalid text search configuration file name "csczutf8"postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,DictFile= "cs_czutf");ERROR: invalid text search configuration file name "cs_czutf"postgres=# CREATE TEXT SEARCH DICTIONARY cz1(TEMPLATE = ispell,DictFile= "csczutf");ERROR: could not open dictionary file"/usr/local/pgsql/share/tsearch_data/csczutf.dict": nen? souborem aniadres??em
regardsPavel Stehule

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oleg Bartunov (#2)
Re: tsearch filenames unlikes special symbols and numbers

Oleg Bartunov <oleg@sai.msu.su> writes:

postgres=# CREATE TEXT SEARCH DICTIONARY ru_ispell (
TEMPLATE = ispell,
DictFile = 'russian-utf8.dict',
AffFile = 'russian-utf8.aff',
StopWords = russian
);
ERROR: invalid text search configuration file name "russian-utf8.dict"

I made it reject all but latin letters, which is the same restriction
that's in place for timezone set filenames. That might be overly
strong, but we definitely have to forbid "." and "/" (and "\" on
Windows). Do we want to restrict it to letters, digits, underscore?
Or does it need to be weaker than that?

Also, I'm wondering do we really need to show all schemas without
text search configurations defined ? Looks rather stranger.

Um ... I don't see that; I get

regression=# \dF
List of text search configurations
Schema | Name | Description
------------+------------+---------------------------------------
pg_catalog | danish | Configuration for danish language
pg_catalog | dutch | Configuration for dutch language
pg_catalog | english | Configuration for english language
pg_catalog | finnish | Configuration for finnish language
pg_catalog | french | Configuration for french language
pg_catalog | german | Configuration for german language
pg_catalog | hungarian | Configuration for hungarian language
pg_catalog | italian | Configuration for italian language
pg_catalog | norwegian | Configuration for norwegian language
pg_catalog | portuguese | Configuration for portuguese language
pg_catalog | romanian | Configuration for romanian language
pg_catalog | russian | Configuration for russian language
pg_catalog | simple | simple configuration
pg_catalog | spanish | Configuration for spanish language
pg_catalog | swedish | Configuration for swedish language
pg_catalog | turkish | Configuration for turkish language
(16 rows)

Are you sure you're using CVS-head psql?

Another problem I see are broken examples of dictionary and parser in
documentation:
http://momjian.us/main/writings/pgsql/sgml/textsearch-rule-dictionary-example.html
http://momjian.us/main/writings/pgsql/sgml/textsearch-parser-example.html

Yeah, I wanted to discuss that with you. Code examples in sgml docs are
a bad idea: they're impossible to use as actual templates, because of
all the weird markup changes, and there's no easy way to notice if
they're broken. It would be better to remove these from the docs and
set them up as contrib modules.

regards, tom lane

#4Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#3)
Re: tsearch filenames unlikes special symbols and numbers

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Oleg Bartunov <oleg@sai.msu.su> writes:

postgres=# CREATE TEXT SEARCH DICTIONARY ru_ispell (
TEMPLATE = ispell,
DictFile = 'russian-utf8.dict',
AffFile = 'russian-utf8.aff',
StopWords = russian
);
ERROR: invalid text search configuration file name "russian-utf8.dict"

I made it reject all but latin letters, which is the same restriction
that's in place for timezone set filenames. That might be overly
strong, but we definitely have to forbid "." and "/" (and "\" on
Windows). Do we want to restrict it to letters, digits, underscore?
Or does it need to be weaker than that?

What's the problem with "."?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#4)
Re: tsearch filenames unlikes special symbols and numbers

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I made it reject all but latin letters, which is the same restriction
that's in place for timezone set filenames. That might be overly
strong, but we definitely have to forbid "." and "/" (and "\" on
Windows). Do we want to restrict it to letters, digits, underscore?
Or does it need to be weaker than that?

What's the problem with "."?

../../../../etc/passwd

Possibly we could allow '.' as long as we forbade /, but the other
trouble with allowing . is that it encourages people to try to specify
the filetype suffix (as indeed Oleg was doing). I'd prefer to keep the
suffixes out of the SQL object definitions, with an eye to possibly
someday migrating all the configuration data inside the database.
There's a reasonable argument for restricting the names used for these
things in the SQL definitions to be valid SQL identifiers, so that that
will work nicely...

regards, tom lane

#6Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#5)
Re: tsearch filenames unlikes special symbols and numbers

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I made it reject all but latin letters, which is the same restriction
that's in place for timezone set filenames. That might be overly
strong, but we definitely have to forbid "." and "/" (and "\" on
Windows). Do we want to restrict it to letters, digits, underscore?
Or does it need to be weaker than that?

What's the problem with "."?

../../../../etc/passwd

Possibly we could allow '.' as long as we forbade /,

Right, traditionally the only characters forbidden in filenames in Unix are /
and nul. If we want the files to play nice in Gnome etc then we should
restrict them to ascii since we don't know what encoding the gui expects.

Actually I think in Windows \ : and . are problems (not allowed more than one
dot in dos).

There's a reasonable argument for restricting the names used for these
things in the SQL definitions to be valid SQL identifiers, so that that
will work nicely...

Ah

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#7Trevor Talbot
quension@gmail.com
In reply to: Gregory Stark (#6)
Re: tsearch filenames unlikes special symbols and numbers

On 9/2/07, Gregory Stark <stark@enterprisedb.com> wrote:

Right, traditionally the only characters forbidden in filenames in Unix are /
and nul. If we want the files to play nice in Gnome etc then we should
restrict them to ascii since we don't know what encoding the gui expects.

Actually I think in Windows \ : and . are problems (not allowed more than one
dot in dos).

Reserved characters in Windows filenames are < > : " / \ | ? *
DOS limitations aren't relevant on the OS versions Postgres supports.

...but I thought this was about opening existing files, not creating
them, in which case the only relevant limitation is path separators.
Any other reserved characters are going to result in no open file,
rather than a security hole.

#8Magnus Hagander
magnus@hagander.net
In reply to: Gregory Stark (#6)
Re: tsearch filenames unlikes special symbols and numbers

On Mon, Sep 03, 2007 at 07:47:14AM +0100, Gregory Stark wrote:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I made it reject all but latin letters, which is the same restriction
that's in place for timezone set filenames. That might be overly
strong, but we definitely have to forbid "." and "/" (and "\" on
Windows). Do we want to restrict it to letters, digits, underscore?
Or does it need to be weaker than that?

What's the problem with "."?

../../../../etc/passwd

Possibly we could allow '.' as long as we forbade /,

Right, traditionally the only characters forbidden in filenames in Unix are /
and nul. If we want the files to play nice in Gnome etc then we should
restrict them to ascii since we don't know what encoding the gui expects.

Actually I think in Windows \ : and . are problems (not allowed more than one
dot in dos).

\ and : are problems.

. is not a problem. We don't support 16-bit windows anyway, and multiple
dots works fine on any system we support.

//Magnus

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#8)
Re: tsearch filenames unlikes special symbols and numbers

Magnus Hagander <magnus@hagander.net> writes:

On Mon, Sep 03, 2007 at 07:47:14AM +0100, Gregory Stark wrote:

Actually I think in Windows \ : and . are problems (not allowed more
than one dot in dos).

\ and : are problems.

Is : really a problem, given that the name in question will be appended
to a known directory's path?

. is not a problem. We don't support 16-bit windows anyway, and multiple
dots works fine on any system we support.

I'm not convinced that . is issue-free. On most if not all versions of Unix,
you are allowed to open a directory as a file and read the filenames it
contains. While I don't say it'd be easy to manage that through
tsearch, there's at least a potential for discovering the filenames
present in . and .. --- how much do we care about that?

regards, tom lane

#10Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#9)
Re: tsearch filenames unlikes special symbols and numbers

On Mon, Sep 03, 2007 at 09:27:19AM -0400, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Mon, Sep 03, 2007 at 07:47:14AM +0100, Gregory Stark wrote:

Actually I think in Windows \ : and . are problems (not allowed more
than one dot in dos).

\ and : are problems.

Is : really a problem, given that the name in question will be appended
to a known directory's path?

Yes. It won't work - the API calls will reject it.

. is not a problem. We don't support 16-bit windows anyway, and multiple
dots works fine on any system we support.

I'm not convinced that . is issue-free. On most if not all versions of Unix,
you are allowed to open a directory as a file and read the filenames it
contains. While I don't say it'd be easy to manage that through
tsearch, there's at least a potential for discovering the filenames
present in . and .. --- how much do we care about that?

I just meant that it's not a problem on Win32 to have a file with multiple
dots in the name. There can certainly be *other* reasons for it. I don't
really see the need to have an extra dot in the filename in this particular
case, so I'd certainly be fine with restricting this one a lot more.

//Magnus

#11Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#9)
Re: tsearch filenames unlikes special symbols and numbers

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I'm not convinced that . is issue-free. On most if not all versions of Unix,
you are allowed to open a directory as a file and read the filenames it
contains. While I don't say it'd be easy to manage that through
tsearch, there's at least a potential for discovering the filenames
present in . and .. --- how much do we care about that?

Actually I don't think that's true any more, most file systems on most Unixen
do not allow it. However it appears it's still the case for Solaris so it's
still a good point.

I'm sure it's not true for modern versions of Linux and I thought it was false
for other modern OSes -- I'm surprised it's not for Solaris even.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#11)
Re: tsearch filenames unlikes special symbols and numbers

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I'm not convinced that . is issue-free. On most if not all versions of Unix,
you are allowed to open a directory as a file and read the filenames it
contains. While I don't say it'd be easy to manage that through
tsearch, there's at least a potential for discovering the filenames
present in . and .. --- how much do we care about that?

Actually I don't think that's true any more, most file systems on most Unixen
do not allow it. However it appears it's still the case for Solaris so it's
still a good point.

Actually, now that I've woken up a bit more, it is not a problem as
long as the tsearch code always appends some kind of file extension
to what the user gives, such as ".dict". It'll be impossible to name
"." or ".." with that addition.

Also, Magnus says that Windows throws an error for ":" in the filename,
which means we needn't.

So the bottom line seems to be that rejecting directory separators
is sufficient to prevent any unwanted file accesses.

It might still be a good idea to restrict the names to be SQL
identifiers (ie, alphanumerics and underscores) for future-proofing,
but it wasn't clear whether anyone but me thought that was a good
argument. I'm willing to make it just be no-dir-separators.

regards, tom lane

#13Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#12)
Re: tsearch filenames unlikes special symbols and numbers

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

It might still be a good idea to restrict the names to be SQL
identifiers (ie, alphanumerics and underscores) for future-proofing,
but it wasn't clear whether anyone but me thought that was a good
argument. I'm willing to make it just be no-dir-separators.

I thought that was a good argument actually.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#14Mark Mielke
mark@mark.mielke.cc
In reply to: Tom Lane (#9)
Re: tsearch filenames unlikes special symbols and numbers

Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Mon, Sep 03, 2007 at 07:47:14AM +0100, Gregory Stark wrote:

Actually I think in Windows \ : and . are problems (not allowed more
than one dot in dos).

\ and : are problems.

Is : really a problem, given that the name in question will be appended
to a known directory's path?

The file name shouldn't have a ':' in it. Accessing a path with multiple
':' in it to open a file for reading should just fail normally. So yes,
there should be no problem.

. is not a problem. We don't support 16-bit windows anyway, and multiple
dots works fine on any system we support.

I'm not convinced that . is issue-free. On most if not all versions of Unix,
you are allowed to open a directory as a file and read the filenames it
contains. While I don't say it'd be easy to manage that through
tsearch, there's at least a potential for discovering the filenames
present in . and .. --- how much do we care about that?

No more than discovering the file names in any other directory without
using '.' or '..'? If it matters, check to ensure it is a regular file
before opening it?

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#15Mark Mielke
mark@mark.mielke.cc
In reply to: Tom Lane (#12)
Re: tsearch filenames unlikes special symbols and numbers

Tom Lane wrote:

Also, ____ says that Windows throws an error for ":" in the filename,
which means we needn't.

Windows doesn't fail - but it can do odd things. For example, try:

C:\> echo hi >foo:bar

If one then checks the directory, one finds a "foo".

Depending on *which* API one uses, the rules may change around a bit -
but whatever the situation, as long as you prefix it with a valid path,
the ":" is not going to cause you problems.

It might still be a good idea to restrict the names to be SQL
identifiers (ie, alphanumerics and underscores) for future-proofing,
but it wasn't clear whether anyone but me thought that was a good
argument. I'm willing to make it just be no-dir-separators.

I think it is a good argument.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#16Trevor Talbot
quension@gmail.com
In reply to: Mark Mielke (#15)
Re: tsearch filenames unlikes special symbols and numbers

On 9/3/07, Mark Mielke <mark@mark.mielke.cc> wrote:

Tom Lane wrote:

Also, ____ says that Windows throws an error for ":" in the filename,
which means we needn't.

Windows doesn't fail - but it can do odd things. For example, try:

C:\> echo hi >foo:bar

If one then checks the directory, one finds a "foo".

: is used for naming streams and attribute types in NTFS filenames.
It's not very well-known functionality and tends to confuse people,
but I'm not aware of any situation where it'd be a problem for read
access. (Creation is not a security risk in the technical sense, but
as most administrators aren't aware of alternate data streams and the
shell does not expose them, it's effectively hidden data.)

If any of you are familiar with MacOS HFS resource forks, NTFS
basically supports an arbitrary number of named forks. A file is
collection of one or more data streams, the single unnamed stream
being default.

#17Decibel!
decibel@decibel.org
In reply to: Tom Lane (#3)
Code examples

Moving to -docs

On Sun, Sep 02, 2007 at 06:46:11PM -0400, Tom Lane wrote:

Another problem I see are broken examples of dictionary and parser in
documentation:
http://momjian.us/main/writings/pgsql/sgml/textsearch-rule-dictionary-example.html
http://momjian.us/main/writings/pgsql/sgml/textsearch-parser-example.html

Yeah, I wanted to discuss that with you. Code examples in sgml docs are
a bad idea: they're impossible to use as actual templates, because of
all the weird markup changes, and there's no easy way to notice if
they're broken. It would be better to remove these from the docs and
set them up as contrib modules.

Couldn't we come up with some method of specifying code examples in the
docs and then having the doc build process actually run those examples
and put that into the doc build?

I wrote some code that does this back when I was thinking about writing
a book, if anyone wants to see it.
--
Decibel!, aka Jim Nasby decibel@decibel.org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#18Florian Pflug
fgp.phlo.org@gmail.com
In reply to: Trevor Talbot (#16)
Re: tsearch filenames unlikes special symbols and numbers

Trevor Talbot wrote:

On 9/3/07, Mark Mielke <mark@mark.mielke.cc> wrote:

Tom Lane wrote:

Also, ____ says that Windows throws an error for ":" in the filename,
which means we needn't.

Windows doesn't fail - but it can do odd things. For example, try:

C:\> echo hi >foo:bar

If one then checks the directory, one finds a "foo".

: is used for naming streams and attribute types in NTFS filenames.
It's not very well-known functionality and tends to confuse people,
but I'm not aware of any situation where it'd be a problem for read
access. (Creation is not a security risk in the technical sense, but
as most administrators aren't aware of alternate data streams and the
shell does not expose them, it's effectively hidden data.)

If any of you are familiar with MacOS HFS resource forks, NTFS
basically supports an arbitrary number of named forks. A file is
collection of one or more data streams, the single unnamed stream
being default.

On MacOS (prior) to OSX, : was used as a directory seperator (Paths
looked like "My Harddisk:My Folder:Somefile"). In OSX, "/" is used,
but for backwards-compatibility the Finder translates "/" in filenames
to ":". So, of you do for example "touch 'my:test'" on the shell,
you see "my/test" in the Finder.

Thats another argument for staying away from : in filenames.

greetings, Florian Pflug

#19Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#5)
Re: tsearch filenames unlikes special symbols and numbers

Tom Lane escribi�:

Possibly we could allow '.' as long as we forbade /, but the other
trouble with allowing . is that it encourages people to try to specify
the filetype suffix (as indeed Oleg was doing). I'd prefer to keep the
suffixes out of the SQL object definitions, with an eye to possibly
someday migrating all the configuration data inside the database.
There's a reasonable argument for restricting the names used for these
things in the SQL definitions to be valid SQL identifiers, so that that
will work nicely...

Well, if we were to use SQL identifiers, we couldn't forbade anything
too much, seeing as almost anything can be used as an identifier, so
long as it is properly quoted.

But it seems to me like we could just pick an convenient subset which
doesn't make any OS too angry about it (say, reject / \ . and :), and
when we get to using actual SQL identifiers, we can enlarge the
supported char set without creating any backwards-compatibility problem.

On the other hand, this means the name has to be quoted if it would be
quoted as an SQL identifier, right?

--
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"Nunca confiar� en un traidor. Ni siquiera si el traidor lo he creado yo"
(Bar�n Vladimir Harkonnen)

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#19)
Re: tsearch filenames unlikes special symbols and numbers

Alvaro Herrera <alvherre@commandprompt.com> writes:

On the other hand, this means the name has to be quoted if it would be
quoted as an SQL identifier, right?

Something like that. I wasn't planning on rejecting uppercase letters,
though, which would be necessary if you wanted to be strict about
matching unquoted identifiers.

There seems fairly clear use-case for allowing A-Z a-z 0-9 and
underscore (while CVS head rejects 0-9 and underscore). There also seem
to be good arguments for disallowing / \ : on various platforms, which
leaves us with some other punctuation in question, as well as the whole
matter of non-ASCII characters. I'm not sure whether we want to touch
the idea of non-ASCII; comments?

regards, tom lane

#21Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Tom Lane (#20)
Re: tsearch filenames unlikes special symbols and numbers

Tom Lane wrote:

I'm not sure whether we want to touch
the idea of non-ASCII; comments?

Non-ASCII filenames sounds like recipe for problems to me. We don't know
what encoding the filenames are in on disk.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#22Ben Tilly
btilly@gmail.com
In reply to: Tom Lane (#20)
Re: tsearch filenames unlikes special symbols and numbers

On 9/3/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

On the other hand, this means the name has to be quoted if it would be
quoted as an SQL identifier, right?

Something like that. I wasn't planning on rejecting uppercase letters,
though, which would be necessary if you wanted to be strict about
matching unquoted identifiers.

There seems fairly clear use-case for allowing A-Z a-z 0-9 and
underscore (while CVS head rejects 0-9 and underscore). There also seem
to be good arguments for disallowing / \ : on various platforms, which
leaves us with some other punctuation in question, as well as the whole
matter of non-ASCII characters. I'm not sure whether we want to touch
the idea of non-ASCII; comments?

The problem with allowing uppercase letters is that on some
filesystems foo and Foo are the same file, and on others they are not.
This may lead to obscure portability problems where code worked fine
on Unix and then fails when the database is running on Windows.

The approach that I'd suggest is allow a very restricted subset as an
immediate solution (say a-z and 0-9), and plan to later allow
arbitrary data to be passed in, then be encoded in some way before
hitting disk. (And later need not be much later - such encodings are
not that hard to write.)

Cheers,
Ben

#23Ben Tilly
btilly@gmail.com
In reply to: Tom Lane (#12)
Re: tsearch filenames unlikes special symbols and numbers

On 9/3/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I'm not convinced that . is issue-free. On most if not all versions of Unix,
you are allowed to open a directory as a file and read the filenames it
contains. While I don't say it'd be easy to manage that through
tsearch, there's at least a potential for discovering the filenames
present in . and .. --- how much do we care about that?

Actually I don't think that's true any more, most file systems on most Unixen
do not allow it. However it appears it's still the case for Solaris so it's
still a good point.

Actually, now that I've woken up a bit more, it is not a problem as
long as the tsearch code always appends some kind of file extension
to what the user gives, such as ".dict". It'll be impossible to name
"." or ".." with that addition.

I don't know what you're discussing well enough to know if this is
relevant, but what you just said is not always true. If there is any
way to pass arbitrary binary data into your function call, then
someone can pass in a string with nul in it. When that hits the OS
API, your appended .dict won't be seen as part of the filename.

(This is a common security oversight when calling C APIs from
higher-level languages such as Perl. See
http://artofhacking.com/files/phrack/phrack55/P55-07.TXT for more.)

[...]

Cheers,
Ben

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ben Tilly (#23)
Re: tsearch filenames unlikes special symbols and numbers

"Ben Tilly" <btilly@gmail.com> writes:

I don't know what you're discussing well enough to know if this is
relevant, but what you just said is not always true. If there is any
way to pass arbitrary binary data into your function call, then
someone can pass in a string with nul in it.

Not a problem here, because the passed-in data is considered
nul-terminated already. (Sometimes, not being 8-bit-clean is
an advantage...)

regards, tom lane

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ben Tilly (#22)
Re: tsearch filenames unlikes special symbols and numbers

"Ben Tilly" <btilly@gmail.com> writes:

On 9/3/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There seems fairly clear use-case for allowing A-Z a-z 0-9 and
underscore (while CVS head rejects 0-9 and underscore).

The problem with allowing uppercase letters is that on some
filesystems foo and Foo are the same file, and on others they are not.
This may lead to obscure portability problems where code worked fine
on Unix and then fails when the database is running on Windows.

Yeah, good point. So far it seems that a-z 0-9 and underscore cover the
real use-cases, so what say we just allow those for now? It's a lot
easier to loosen up later than tighten up ...

regards, tom lane

#26Peter Eisentraut
peter_e@gmx.net
In reply to: Decibel! (#17)
Re: Code examples

Decibel! wrote:

Couldn't we come up with some method of specifying code examples in
the docs and then having the doc build process actually run those
examples and put that into the doc build?

While that seems very tempting, I think you need manual review to check
whether the examples make didactic sense. For example, I seem to
recall that we had to change some examples about how operator
precendence or type casting gives unexpected results several times over
the years because the unexpected results had turned into expected
results in response to new features. If you'd just produce the
documentation examples automatically, you'd be left with quite
embarrassing nonsense in there.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#26)
Re: Code examples

Peter Eisentraut <peter_e@gmx.net> writes:

Decibel! wrote:

Couldn't we come up with some method of specifying code examples in
the docs and then having the doc build process actually run those
examples and put that into the doc build?

While that seems very tempting, I think you need manual review to check
whether the examples make didactic sense. For example, I seem to
recall that we had to change some examples about how operator
precendence or type casting gives unexpected results several times over
the years because the unexpected results had turned into expected
results in response to new features. If you'd just produce the
documentation examples automatically, you'd be left with quite
embarrassing nonsense in there.

Well, both my point and Jim's was that trying to compile and run the
code would help to expose such silliness.

In the case at hand, there's another consideration: the examples are
large and are (I believe) intended as working skeletons for real code
development. They'd be a lot more useful for that purpose if they were
available as actual contrib modules. C code that's been hacked until
it passes for SGML isn't compilable.

regards, tom lane

#28Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tom Lane (#25)
Re: tsearch filenames unlikes special symbols and numbers

2007/9/4, Tom Lane <tgl@sss.pgh.pa.us>:

"Ben Tilly" <btilly@gmail.com> writes:

On 9/3/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There seems fairly clear use-case for allowing A-Z a-z 0-9 and
underscore (while CVS head rejects 0-9 and underscore).

The problem with allowing uppercase letters is that on some
filesystems foo and Foo are the same file, and on others they are not.
This may lead to obscure portability problems where code worked fine
on Unix and then fails when the database is running on Windows.

Yeah, good point. So far it seems that a-z 0-9 and underscore cover the
real use-cases, so what say we just allow those for now? It's a lot
easier to loosen up later than tighten up ...

regards, tom lane

It's system specific. I prefere a-z and A-Z. Clasic name for
dictionaries combine lower and upper characters .. for czech
cs_CZ_UTF8 etc.

dictfile = cs_CZ_UTF8 ... automatic convert to cs_cz_utf8.dict
dictfile = 'cs_CZ_UTF8' .. check and use cs_CZ_UTF8

Regards
Pavel Stehule

p.s. it's important on UNIX platforms and without any efect on windows.

#29Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#27)
Re: Code examples

Am Dienstag, 4. September 2007 02:39 schrieb Tom Lane:

C code that's been hacked until it passes for SGML isn't compilable.

I don't understand this point. Why would SGML care what the C code looks
like?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Pavel Stehule (#28)
Re: tsearch filenames unlikes special symbols and numbers

"Pavel Stehule" <pavel.stehule@gmail.com> writes:

2007/9/4, Tom Lane <tgl@sss.pgh.pa.us>:

Yeah, good point. So far it seems that a-z 0-9 and underscore cover the
real use-cases, so what say we just allow those for now? It's a lot
easier to loosen up later than tighten up ...

It's system specific. I prefere a-z and A-Z. Clasic name for
dictionaries combine lower and upper characters .. for czech
cs_CZ_UTF8 etc.

You're going to need to alter that habit anyway, because it's not
appropriate to mention any specific encoding in the dictionary name.

But on further thought it strikes me that insisting on all lower case
doesn't eliminate case-sensitivity portability problems. For instance,
suppose the given parameter is 'foo' and the actual file name is
Foo.dict. This will work fine on Windows and will stop working when
moved to Unix. So I'm not sure we really buy much by rejecting
upper-case letters in the parameter --- all we do is constrain which
side of the fence you have to fix any mismatches on. And we picked the
side that only a DBA, rather than a plain SQL user, can fix.

regards, tom lane

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#29)
Re: Code examples

Peter Eisentraut <peter_e@gmx.net> writes:

Am Dienstag, 4. September 2007 02:39 schrieb Tom Lane:

C code that's been hacked until it passes for SGML isn't compilable.

I don't understand this point. Why would SGML care what the C code looks
like?

&, <, and > need to be hacked so that SGML doesn't barf on them.
Unfortunately, all three symbols are a bit commonplace in C code.

Now admittedly this can be fixed with moderately simple
search-and-replaces, but it's still another obstacle in the path of
someone who actually wishes to use the code for its intended purpose,
or even someone who would like to find out if the examples aren't
broken.

regards, tom lane

#32Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#31)
Re: Code examples

Tom Lane escribi�:

Peter Eisentraut <peter_e@gmx.net> writes:

Am Dienstag, 4. September 2007 02:39 schrieb Tom Lane:

C code that's been hacked until it passes for SGML isn't compilable.

I don't understand this point. Why would SGML care what the C code looks
like?

&, <, and > need to be hacked so that SGML doesn't barf on them.
Unfortunately, all three symbols are a bit commonplace in C code.

Maybe we could set things up so that there are actual files which are
programatically preprocessed to SGML to be included in the docs? That
way, the docs always reflect the actual file, which by itself is
compilable. The SGML source would only contain something like
<include file="examples/foo.c" /> or something like that.

Is that feasible?

--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"No necesitamos banderas
No reconocemos fronteras" (Jorge Gonz�lez)

#33Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#31)
Re: Code examples

Am Dienstag, 4. September 2007 16:11 schrieb Tom Lane:

&, <, and > need to be hacked so that SGML doesn't barf on them.
Unfortunately, all three symbols are a bit commonplace in C code.

I assume that someone who wants to try out the code would copy it from the
HTML, not out of the SGML source.

But in any case you can avoid the escaping like so:

<![CDATA[
... code ...
]]>

Grep for existing uses.

The idea of including the C files directly could also work.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#34Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tom Lane (#30)
Re: tsearch filenames unlikes special symbols and numbers

2007/9/4, Tom Lane <tgl@sss.pgh.pa.us>:

"Pavel Stehule" <pavel.stehule@gmail.com> writes:

2007/9/4, Tom Lane <tgl@sss.pgh.pa.us>:

Yeah, good point. So far it seems that a-z 0-9 and underscore cover the
real use-cases, so what say we just allow those for now? It's a lot
easier to loosen up later than tighten up ...

It's system specific. I prefere a-z and A-Z. Clasic name for
dictionaries combine lower and upper characters .. for czech
cs_CZ_UTF8 etc.

You're going to need to alter that habit anyway, because it's not
appropriate to mention any specific encoding in the dictionary name.

But on further thought it strikes me that insisting on all lower case
doesn't eliminate case-sensitivity portability problems. For instance,
suppose the given parameter is 'foo' and the actual file name is
Foo.dict. This will work fine on Windows and will stop working when
moved to Unix. So I'm not sure we really buy much by rejecting
upper-case letters in the parameter --- all we do is constrain which
side of the fence you have to fix any mismatches on. And we picked the
side that only a DBA, rather than a plain SQL user, can fix.

ok. I can understand it. But I don't see sense of quoting of params

Regards
Pavel Stehule

#35Ben Tilly
btilly@gmail.com
In reply to: Tom Lane (#30)
Re: tsearch filenames unlikes special symbols and numbers

On 9/4/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
[...]

But on further thought it strikes me that insisting on all lower case
doesn't eliminate case-sensitivity portability problems. For instance,
suppose the given parameter is 'foo' and the actual file name is
Foo.dict. This will work fine on Windows and will stop working when
moved to Unix. So I'm not sure we really buy much by rejecting
upper-case letters in the parameter --- all we do is constrain which
side of the fence you have to fix any mismatches on. And we picked the
side that only a DBA, rather than a plain SQL user, can fix.

True, only a DBA can fix it. But only a DBA can screw it up. That
seems reasonable to me. Furthermore fixing this mistake at the plain
SQL user level in reality means auditing a code base for the
construct, which is never fun.

However if you wish to be paranoid, I believe that all filesystems of
interest to PostgreSQL are at least case preserving. In which case on
case sensitive filesystems you could check that the case of the stored
filename matches what you want it to be. Now the problem of the
filename having the case wrong can be detected on both Windows and
Unix.

Of course that check is a complication and slows things down. If all
dictionary files have to be in a fixed directory, then you can easily
add a cron job that scans that directory and fixes the case of any
dictionary files that have upper case letters in their names.
(Beware, there was once a bug in Windows where renaming Foo to foo
accidentally deleted the file. It is therefore safer to rename Foo to
bar then bar to foo. However this is a moot point since I doubt that
anyone would actually run a brand new PostgreSQL database on an early
version of NT 4.0...)

Cheers,
Ben

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#32)
Re: Code examples

Alvaro Herrera <alvherre@commandprompt.com> writes:

Maybe we could set things up so that there are actual files which are
programatically preprocessed to SGML to be included in the docs? That
way, the docs always reflect the actual file, which by itself is
compilable. The SGML source would only contain something like
<include file="examples/foo.c" /> or something like that.

Well, if we have actual contrib modules (which is still a good idea
so that they get tested on a regular basis), I don't see any need to
copy the code into the docs at all. The docs should just say "a working
example can be found in contrib/whatever".

regards, tom lane

#37Oleg Bartunov
oleg@sai.msu.su
In reply to: Tom Lane (#36)
Re: Code examples

On Tue, 4 Sep 2007, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Maybe we could set things up so that there are actual files which are
programatically preprocessed to SGML to be included in the docs? That
way, the docs always reflect the actual file, which by itself is
compilable. The SGML source would only contain something like
<include file="examples/foo.c" /> or something like that.

Well, if we have actual contrib modules (which is still a good idea
so that they get tested on a regular basis), I don't see any need to
copy the code into the docs at all. The docs should just say "a working
example can be found in contrib/whatever".

I thin Tom is right. We already have many user's dictionaries which
would be worth to distribute.

regards, tom lane

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#38Oleg Bartunov
oleg@sai.msu.su
In reply to: Tom Lane (#5)
Re: tsearch filenames unlikes special symbols and numbers

On Sun, 2 Sep 2007, Tom Lane wrote:

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I made it reject all but latin letters, which is the same restriction
that's in place for timezone set filenames. That might be overly
strong, but we definitely have to forbid "." and "/" (and "\" on
Windows). Do we want to restrict it to letters, digits, underscore?
Or does it need to be weaker than that?

What's the problem with "."?

../../../../etc/passwd

Possibly we could allow '.' as long as we forbade /, but the other
trouble with allowing . is that it encourages people to try to specify
the filetype suffix (as indeed Oleg was doing). I'd prefer to keep the
suffixes out of the SQL object definitions, with an eye to possibly
someday migrating all the configuration data inside the database.
There's a reasonable argument for restricting the names used for these
things in the SQL definitions to be valid SQL identifiers, so that that
will work nicely...

So, what's the current policy ? Still a-z, A-Z ? I think we should allow
'.' and prevent '/'. Look, how ugly is our current ispell setup, which
depends on 3 files - stop word list, .dict and .aff.

Right now, I can use something like

CREATE TEXT SEARCH DICTIONARY en_ispell (
TEMPLATE = ispell,
DictFile = englishDict,
AffFile = englishAff,
StopWords = english
);

I'd better use english.dict, english.aff, english.stop, whih is usual for
any user, without dictating user here. We already did a lot of
restrictions.

I hope we won't require special extension like .dict, .aff, since it's
unknown in advance what files will use other dictionaries.
If we allow '.' without '/', then we'd be happy.
I'd remove requirement for extension of stop words list, which looks
rather artificially to me.

Oh, my god, I see we dictate extensions !

STATEMENT: CREATE TEXT SEARCH DICTIONARY en_ispell (
TEMPLATE = ispell,
DictFile = englishDict,
AffFile = englishAff,
StopWords = englishStop
);
ERROR: could not open dictionary file "/usr/local/pgsql-dev/share/tsearch_data/englishdict.dict": No such file or directory

Folk, this is too much ! Now, we dictate extensions '.dict, .affix, .stop',
what else ?

Does it defined by ispell template only, or it's global requirements ?

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#39Oleg Bartunov
oleg@sai.msu.su
In reply to: Oleg Bartunov (#38)
Re: tsearch filenames unlikes special symbols and numbers

On Sun, 9 Sep 2007, Oleg Bartunov wrote:

Oh, my god, I see we dictate extensions !

STATEMENT: CREATE TEXT SEARCH DICTIONARY en_ispell (
TEMPLATE = ispell,
DictFile = englishDict,
AffFile = englishAff,
StopWords = englishStop
);
ERROR: could not open dictionary file
"/usr/local/pgsql-dev/share/tsearch_data/englishdict.dict": No such file or
directory

Folk, this is too much ! Now, we dictate extensions '.dict, .affix, .stop',
what else ?

I notice, that documentation doesn't mention about this
http://momjian.us/main/writings/pgsql/sgml/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oleg Bartunov (#38)
Re: tsearch filenames unlikes special symbols and numbers

Oleg Bartunov <oleg@sai.msu.su> writes:

Oh, my god, I see we dictate extensions !
Folk, this is too much ! Now, we dictate extensions '.dict, .affix, .stop',
what else ?

Does it defined by ispell template only, or it's global requirements ?

It's the callers of get_tsearch_config_filename() that specify the
extension, so AFAICS each dictionary can do what it wants. I don't see
the problem with enforcing an extension: it keeps the namespaces for
different kinds of files separate, and it gets us out of the potential
security risk of allowing access to "." or "..".

I remain of the opinion that we don't really want the SQL-command
definitions of dictionaries to expose the fact that these are files at
all. We should be thinking of the command parameters as identifiers.

regards, tom lane

#41Erik Rijkers
er@xs4all.nl
In reply to: Tom Lane (#31)
extract('dow', ...) mention

Hi,

It seems to me this mention of extract()
in the manual (chapter 9.8):
extract('dow', ...)

is better replaced with
extract(dow from ...)

( 8.3, 8.4, HEAD )

hth

Erik Rijkers

--- doc/src/sgml/func.sgml.orig 2009-11-15 21:58:02.000000000 +0100
+++ doc/src/sgml/func.sgml      2009-11-15 21:59:16.000000000 +0100
@@ -5274,9 +5274,9 @@
      <listitem>
       <para>
         <function>to_char(..., 'ID')</function>'s day of the week numbering
-        matches the <function>extract('isodow', ...)</function> function, but
+        matches the <function>extract(isodow from ...)</function> function, but
         <function>to_char(..., 'D')</function>'s does not match
-        <function>extract('dow', ...)</function>'s day numbering.
+        <function>extract(dow from  ...)</function>'s day numbering.
       </para>
      </listitem>
#42Peter Eisentraut
peter_e@gmx.net
In reply to: Erik Rijkers (#41)
Re: extract('dow', ...) mention

On sön, 2009-11-15 at 22:50 +0100, Erik Rijkers wrote:

It seems to me this mention of extract()
in the manual (chapter 9.8):
extract('dow', ...)

is better replaced with
extract(dow from ...)

( 8.3, 8.4, HEAD )

Fix applied to those versions. Thanks.