pgsql: Temporarily modify tsearch regression tests to suppress notice

Started by Nonameover 18 years ago11 messages
#1Noname
tgl@postgresql.org

Log Message:
-----------
Temporarily modify tsearch regression tests to suppress notice that comes
out at erratic times, because it is creating a totally unacceptable level
of noise in our buildfarm results. This patch can be reverted when and if
the code is fixed to not issue notices during cache reload events.

Modified Files:
--------------
pgsql/src/backend/tsearch:
thesaurus_sample.ths (r1.1 -> r1.2)
(http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/tsearch/thesaurus_sample.ths?r1=1.1&r2=1.2)
pgsql/src/test/regress/expected:
tsdicts.out (r1.1 -> r1.2)
(http://developer.postgresql.org/cvsweb.cgi/pgsql/src/test/regress/expected/tsdicts.out?r1=1.1&r2=1.2)

#2Bruce Momjian
bruce@momjian.us
In reply to: Noname (#1)
Re: [COMMITTERS] pgsql: Temporarily modify tsearch regression tests to suppress notice

I just talked to Teodor and we discussed this problem. My idea is to
have a special marker in the synonym table, perhaps "*" to indicate the
presence of _any_ stop word at that location. This will not produce any
warnings because it is clearly intentional. The original warning for a
literal stop word will remain but will happen only when the user users
an actual unintentional use of a stop word. The synonym examples will
need to be updated to use "*" as stop word markers.

---------------------------------------------------------------------------

Tom Lane wrote:

Log Message:
-----------
Temporarily modify tsearch regression tests to suppress notice that comes
out at erratic times, because it is creating a totally unacceptable level
of noise in our buildfarm results. This patch can be reverted when and if
the code is fixed to not issue notices during cache reload events.

Modified Files:
--------------
pgsql/src/backend/tsearch:
thesaurus_sample.ths (r1.1 -> r1.2)
(http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/tsearch/thesaurus_sample.ths?r1=1.1&r2=1.2)
pgsql/src/test/regress/expected:
tsdicts.out (r1.1 -> r1.2)
(http://developer.postgresql.org/cvsweb.cgi/pgsql/src/test/regress/expected/tsdicts.out?r1=1.1&r2=1.2)

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#2)
Re: [COMMITTERS] pgsql: Temporarily modify tsearch regression tests to suppress notice

Bruce Momjian <bruce@momjian.us> writes:

I just talked to Teodor and we discussed this problem. My idea is to
have a special marker in the synonym table, perhaps "*" to indicate the
presence of _any_ stop word at that location. This will not produce any
warnings because it is clearly intentional.

That's not fixing the problem, unless your proposal includes never
issuing any warnings at all, for anything.

regards, tom lane

#4Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#3)
Re: [COMMITTERS] pgsql: Temporarily modify tsearch regression tests to suppress notice

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

I just talked to Teodor and we discussed this problem. My idea is to
have a special marker in the synonym table, perhaps "*" to indicate the
presence of _any_ stop word at that location. This will not produce any
warnings because it is clearly intentional.

That's not fixing the problem, unless your proposal includes never
issuing any warnings at all, for anything.

No warning for "*" because it is intentional, but warning for actual
stop words.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#4)
Re: [COMMITTERS] pgsql: Temporarily modify tsearch regression tests to suppress notice

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

That's not fixing the problem, unless your proposal includes never
issuing any warnings at all, for anything.

No warning for "*" because it is intentional, but warning for actual
stop words.

No, you're focusing on one symptom not the problem. The problem is
that we've got user-visible behavior going on during what's effectively
a chance event, ie, a cache reload.

One possible real solution would be to tweak the dictionary APIs so
that the dictionaries can find out whether this is the first load during
a session, or a reload, and emit notices only in the first case.

regards, tom lane

#6Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#5)
Re: [COMMITTERS] pgsql: Temporarily modify tsearch regression tests to suppress notice

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

That's not fixing the problem, unless your proposal includes never
issuing any warnings at all, for anything.

No warning for "*" because it is intentional, but warning for actual
stop words.

No, you're focusing on one symptom not the problem. The problem is
that we've got user-visible behavior going on during what's effectively
a chance event, ie, a cache reload.

One possible real solution would be to tweak the dictionary APIs so
that the dictionaries can find out whether this is the first load during
a session, or a reload, and emit notices only in the first case.

Yea, that would work too. Or just throw an error for a stop word in the
file and then you never get a reload (use "*" instead).

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#6)
Re: [COMMITTERS] pgsql: Temporarily modify tsearch regression tests to suppress notice

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

One possible real solution would be to tweak the dictionary APIs so
that the dictionaries can find out whether this is the first load during
a session, or a reload, and emit notices only in the first case.

Yea, that would work too. Or just throw an error for a stop word in the
file and then you never get a reload (use "*" instead).

Hm, that's a thought --- it'd be a way to solve the problem without an
API change for dictionaries, which is something to avoid at this late
stage of the 8.3 cycle. Come to think of it, does the ts_cache stuff
work properly when an error is thrown in dictionary load (ie, is the
cache entry left in a sane state)?

regards, tom lane

#8Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#7)
1 attachment(s)
Fix for stop words in thesaurus file

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

One possible real solution would be to tweak the dictionary APIs so
that the dictionaries can find out whether this is the first load during
a session, or a reload, and emit notices only in the first case.

Yea, that would work too. Or just throw an error for a stop word in the
file and then you never get a reload (use "*" instead).

Hm, that's a thought --- it'd be a way to solve the problem without an
API change for dictionaries, which is something to avoid at this late
stage of the 8.3 cycle. Come to think of it, does the ts_cache stuff
work properly when an error is thrown in dictionary load (ie, is the
cache entry left in a sane state)?

I have developed the attached patch which uses "?" to mark stop words in
the thesaurus file. ("*" was already in use in the file.) I updated
the docs to use "?", which makes the documentation clearer too.

The patch also reenables testing of stop words in the thesuarus file.

FYI, there is no longer a NOTICE for stop words in the thesaurus file;
it throws an error now, and says to use "?" instead.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachments:

/pgpatches/ts_thesaurustext/x-diffDownload
Index: doc/src/sgml/textsearch.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.30
diff -c -c -r1.30 textsearch.sgml
*** doc/src/sgml/textsearch.sgml	5 Nov 2007 15:55:53 -0000	1.30
--- doc/src/sgml/textsearch.sgml	9 Nov 2007 02:26:17 -0000
***************
*** 2258,2277 ****
     </para>
  
     <para>
!     Stop words recognized by the subdictionary are replaced by a <quote>stop
!     word placeholder</quote> to record their position. To illustrate this,
!     consider these phrases:
  
  <programlisting>
! a one the two : swsw
! the one a two : swsw2
  </programlisting>
  
!     Assuming that <literal>a</> and <literal>the</> are stop words according
!     to the subdictionary, these two phrases are identical to the thesaurus:
!     they both look like <replaceable>stopword</> <literal>one</>
!     <replaceable>stopword</> <literal>two</>.  Input matching this pattern
!     will be replaced by <literal>swsw2</>, according to the tie-breaking rule.
     </para>
  
     <para>
--- 2258,2274 ----
     </para>
  
     <para>
!     Specific stop words recognized by the subdictionary cannot be
!     specified;  instead use <literal>?</> to mark the location where any
!     stop word can appear.  For example, assuming that <literal>a</> and
!     <literal>the</> are stop words according to the subdictionary:
  
  <programlisting>
! ? one ? two : swsw
  </programlisting>
  
!     matches <literal>a one the two</> and <literal>the one a two</>;
!     both would be replaced by <literal>swsw</>.
     </para>
  
     <para>
Index: src/backend/tsearch/dict_thesaurus.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/dict_thesaurus.c,v
retrieving revision 1.5
diff -c -c -r1.5 dict_thesaurus.c
*** src/backend/tsearch/dict_thesaurus.c	9 Nov 2007 01:32:22 -0000	1.5
--- src/backend/tsearch/dict_thesaurus.c	9 Nov 2007 02:26:17 -0000
***************
*** 412,458 ****
  	{
  		TSLexeme   *ptr;
  
! 		ptr = (TSLexeme *) DatumGetPointer(FunctionCall4(&(d->subdict->lexize),
! 									   PointerGetDatum(d->subdict->dictData),
! 										  PointerGetDatum(d->wrds[i].lexeme),
! 									Int32GetDatum(strlen(d->wrds[i].lexeme)),
! 													 PointerGetDatum(NULL)));
! 
! 		if (!ptr)
! 			elog(ERROR, "thesaurus word-sample \"%s\" isn't recognized by subdictionary (rule %d)",
! 				 d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
! 		else if (!(ptr->lexeme))
! 		{
! 			elog(NOTICE, "thesaurus word-sample \"%s\" is recognized as stop-word, assign any stop-word (rule %d)",
! 				 d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
! 
  			newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, NULL, d->wrds[i].entries, 0);
- 		}
  		else
  		{
! 			while (ptr->lexeme)
  			{
! 				TSLexeme   *remptr = ptr + 1;
! 				int			tnvar = 1;
! 				int			curvar = ptr->nvariant;
! 
! 				/* compute n words in one variant */
! 				while (remptr->lexeme)
  				{
! 					if (remptr->nvariant != (remptr - 1)->nvariant)
! 						break;
! 					tnvar++;
! 					remptr++;
! 				}
! 
! 				remptr = ptr;
! 				while (remptr->lexeme && remptr->nvariant == curvar)
! 				{
! 					newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, remptr, d->wrds[i].entries, tnvar);
! 					remptr++;
  				}
- 
- 				ptr = remptr;
  			}
  		}
  
--- 412,459 ----
  	{
  		TSLexeme   *ptr;
  
! 		if (strcmp(d->wrds[i].lexeme, "?") == 0)	/* Is stop word marker? */
  			newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, NULL, d->wrds[i].entries, 0);
  		else
  		{
! 			ptr = (TSLexeme *) DatumGetPointer(FunctionCall4(&(d->subdict->lexize),
! 										   PointerGetDatum(d->subdict->dictData),
! 											  PointerGetDatum(d->wrds[i].lexeme),
! 										Int32GetDatum(strlen(d->wrds[i].lexeme)),
! 														 PointerGetDatum(NULL)));
! 	
! 			if (!ptr)
! 				elog(ERROR, "thesaurus word-sample \"%s\" isn't recognized by subdictionary (rule %d)",
! 					 d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
! 			else if (!(ptr->lexeme))
! 				elog(ERROR, "thesaurus word-sample \"%s\" is recognized as stop-word, use \"?\" for stop words instead (rule %d)",
! 					 d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
! 			else
  			{
! 				while (ptr->lexeme)
  				{
! 					TSLexeme   *remptr = ptr + 1;
! 					int			tnvar = 1;
! 					int			curvar = ptr->nvariant;
! 	
! 					/* compute n words in one variant */
! 					while (remptr->lexeme)
! 					{
! 						if (remptr->nvariant != (remptr - 1)->nvariant)
! 							break;
! 						tnvar++;
! 						remptr++;
! 					}
! 	
! 					remptr = ptr;
! 					while (remptr->lexeme && remptr->nvariant == curvar)
! 					{
! 						newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, remptr, d->wrds[i].entries, tnvar);
! 						remptr++;
! 					}
! 	
! 					ptr = remptr;
  				}
  			}
  		}
  
Index: src/backend/tsearch/thesaurus_sample.ths
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/thesaurus_sample.ths,v
retrieving revision 1.2
diff -c -c -r1.2 thesaurus_sample.ths
*** src/backend/tsearch/thesaurus_sample.ths	23 Sep 2007 15:58:58 -0000	1.2
--- src/backend/tsearch/thesaurus_sample.ths	9 Nov 2007 02:26:17 -0000
***************
*** 14,17 ****
  supernovae stars : *sn
  supernovae : *sn
  booking tickets : order invitation cards
! # booking the tickets : order invitation Cards
--- 14,18 ----
  supernovae stars : *sn
  supernovae : *sn
  booking tickets : order invitation cards
! booking ? tickets : order invitation Cards
! 
Index: src/test/regress/expected/tsdicts.out
===================================================================
RCS file: /cvsroot/pgsql/src/test/regress/expected/tsdicts.out,v
retrieving revision 1.3
diff -c -c -r1.3 tsdicts.out
*** src/test/regress/expected/tsdicts.out	23 Oct 2007 20:46:12 -0000	1.3
--- src/test/regress/expected/tsdicts.out	9 Nov 2007 02:26:20 -0000
***************
*** 311,318 ****
  (1 row)
  
  SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets');
!                              to_tsvector                             
! ---------------------------------------------------------------------
!  'book':8 'card':3 'like':6 'look':5 'invit':2 'order':1 'ticket':10
  (1 row)
  
--- 311,318 ----
  (1 row)
  
  SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets');
!                       to_tsvector                      
! -------------------------------------------------------
!  'card':3,10 'like':6 'look':5 'invit':2,9 'order':1,8
  (1 row)
  
#9Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#8)
Re: Fix for stop words in thesaurus file

On Thu, 2007-11-08 at 21:32 -0500, Bruce Momjian wrote:

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

One possible real solution would be to tweak the dictionary APIs so
that the dictionaries can find out whether this is the first load during
a session, or a reload, and emit notices only in the first case.

Yea, that would work too. Or just throw an error for a stop word in the
file and then you never get a reload (use "*" instead).

Hm, that's a thought --- it'd be a way to solve the problem without an
API change for dictionaries, which is something to avoid at this late
stage of the 8.3 cycle. Come to think of it, does the ts_cache stuff
work properly when an error is thrown in dictionary load (ie, is the
cache entry left in a sane state)?

I have developed the attached patch which uses "?" to mark stop words in
the thesaurus file. ("*" was already in use in the file.) I updated
the docs to use "?", which makes the documentation clearer too.

The patch also reenables testing of stop words in the thesuarus file.

FYI, there is no longer a NOTICE for stop words in the thesaurus file;
it throws an error now, and says to use "?" instead.

So this fix requires people to have a different dictionary file from 8.2
to 8.3, and to manually edit the file? That makes upgrade even harder
and more error prone.

Seems easier to do it the way Tom suggested and only emit notices in the
first case.

I notice there's still a placeholder in the docs for how to upgrade.
Perhaps if we wrote that now it would make it clearer where the
difficulties lie?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#10Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#9)
Re: Fix for stop words in thesaurus file

Simon Riggs wrote:

I have developed the attached patch which uses "?" to mark stop words in
the thesaurus file. ("*" was already in use in the file.) I updated
the docs to use "?", which makes the documentation clearer too.

The patch also reenables testing of stop words in the thesuarus file.

FYI, there is no longer a NOTICE for stop words in the thesaurus file;
it throws an error now, and says to use "?" instead.

So this fix requires people to have a different dictionary file from 8.2
to 8.3, and to manually edit the file? That makes upgrade even harder
and more error prone.

Seems easier to do it the way Tom suggested and only emit notices in the
first case.

There is no easy way to do that so I have not heard that proposed as an
actual solution.

I notice there's still a placeholder in the docs for how to upgrade.
Perhaps if we wrote that now it would make it clearer where the
difficulties lie?

Yes, this would be one of the many changes in the API. Of course, if
you did use a stop word you would now get an error and it would tell you
explicitly to use "?" for stop words, so I don't see a problem here.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#11Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#8)
Re: Fix for stop words in thesaurus file

Bruce Momjian wrote:

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

One possible real solution would be to tweak the dictionary APIs so
that the dictionaries can find out whether this is the first load during
a session, or a reload, and emit notices only in the first case.

Yea, that would work too. Or just throw an error for a stop word in the
file and then you never get a reload (use "*" instead).

Hm, that's a thought --- it'd be a way to solve the problem without an
API change for dictionaries, which is something to avoid at this late
stage of the 8.3 cycle. Come to think of it, does the ts_cache stuff
work properly when an error is thrown in dictionary load (ie, is the
cache entry left in a sane state)?

I have developed the attached patch which uses "?" to mark stop words in
the thesaurus file. ("*" was already in use in the file.) I updated
the docs to use "?", which makes the documentation clearer too.

The patch also reenables testing of stop words in the thesuarus file.

FYI, there is no longer a NOTICE for stop words in the thesaurus file;
it throws an error now, and says to use "?" instead.

Patch applied. I added documentation on this API change too.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +