[PROPOSAL] Improvements of Hunspell dictionaries support
Hi.
Introduction
============
PostgreSQL full-text search extension uses dictionaries from the various
open source spell checker software to perform word normalization.
Currently, Ispell, MySpell and Hunspell dictionaries are supported.
Dictionaries requires two files: a dictionary file and an affix file. A
dictionary file contains a list of words. Each word may be followed by
one or more affix flags. An affix file contains a lot of parameters,
definitions, prefix and suffix classes used in a dictionary file.
Most complete and actively developed are Hunspell dictionaries
(http://hunspell.sourceforge.net/). OpenOffice and LibreOffice projects
recently switched from MySpell to Hunspell dictionaries.
But PostgreSQL is unable to load recent version of Hunsplell
dictionaries for several languages.
It is because affix files of these dictionaries grow too big.
Traditionally affix rules are named by one extended ASCII (8-bit)
symbol. And if there is more than 192 rules, some syntax extension is
needed.
And to handle these dictionaries Hunspell have FLAG parameter with the
following values:
* FLAG long - sets the double extended ASCII character flag type
* FLAG num - sets the decimal number flag type (from 1 to 65000)
These flag types are used in affix files of such dictionaries as ar,
br_fr, ca, ca_valencia, da_dk, en_ca, en_gb, en_us, fr, gl_es, is,
ne_np, nl_nl, si_lk (from
http://cgit.freedesktop.org/libreoffice/dictionaries/tree/). But
PostgreSQL does not support FLAG parameter and can not load these
dictionaries.
There is also AF parameter which allows to substitute affix flag sets
with ordinal numbers in affix and dictionary files.
FLAG and AF parameters are not supported by PostgreSQL. Supporting these
parameters allows to load dictionaries listed above into PostgreSQL
database and use them in full text search.
Proposed Changes
================
Internal representation of the dictionary in the PostgreSQL doesn't
impose too strict limits on the number of affix rules. There are a
flagval array, which size must be increased from 256 to 65000.
All other changes is the changes in the affix file parsing code to
properly parse long and numeric flags.
I've already implemented support for FLAG long, it require relatively
small patch size (60 lines). Support for FLAG num would require
comparable amount of code.
These changes would allow to use recent versions of Hunspell
dictionaries for following dictionaries:
br_fr, ca, ca_valencia, da_dk, gl_es, is, ne_np, nl_nl, si_lk.
Implementation of AF flag would allow to support also following
dictionaries:
ar, en_ca, en_gb, en_us, fr, hu_hu.
Expected Results
================
These changes would allow to use more recent and complete spelling
dictionaries to perform word stemming during full-text indexing.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/20/15 9:00 AM, Artur Zakirov wrote:
Internal representation of the dictionary in the PostgreSQL doesn't
impose too strict limits on the number of affix rules. There are a
flagval array, which size must be increased from 256 to 65000.
Is that per dictionary entry, fixed at 64k? That seems pretty excessive,
if that's the case...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
21.10.2015 01:37, Jim Nasby пишет:
On 10/20/15 9:00 AM, Artur Zakirov wrote:
Internal representation of the dictionary in the PostgreSQL doesn't
impose too strict limits on the number of affix rules. There are a
flagval array, which size must be increased from 256 to 65000.Is that per dictionary entry, fixed at 64k? That seems pretty excessive,
if that's the case...
This is per dictionary only. flagval array is used for the all
dictionary. And it is not used for every dictionary word.
There are also flag field of AFFIX structure, wich size must be
increased from 8 bit to 16 bit. This structure is used for every affix
in affix file.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
20.10.2015 17:00, Artur Zakirov пишет:
These flag types are used in affix files of such dictionaries as ar,
br_fr, ca, ca_valencia, da_dk, en_ca, en_gb, en_us, fr, gl_es, is,
ne_np, nl_nl, si_lk (from
http://cgit.freedesktop.org/libreoffice/dictionaries/tree/).
Now almost all dictionaries are loaded into PostgreSQL. But the da_dk
dictionary does not load. I see the following error:
ERROR: invalid regular expression: quantifier operand invalid
CONTEXT: line 439 of configuration file
"/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s
+GENITIV
If you open the affix file in editor you can see that there is incorrect
format of the affix 55 in 439 line (screen1.png):
SFX 55 0 s +GENITIV
SFX parameter should have a 5 fields. There are no field between "0"
digit and "s" symbol. "+GENITIV" is the optional morphological field and
ignored by PostgreSQL.
I think that it is a error of the affix file. I wrote a e-mail to
info@stavekontrolden.dk to the dictionary authors about this error.
What is the right decision in this case? Should PostgreSQL ignore this
error and do not show it?
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
Hello again!
Patches
=======
I had implemented support for FLAG long, FLAG num and AF parameters. I
attached patch to the e-mail (hunspell-dict.patch).
This patch allow to use Hunspell dictionaries listed in the previous
e-mail: ar, br_fr, ca, ca_valencia, en_ca, en_gb, en_us, fr, gl_es,
hu_hu, is, ne_np, nl_nl, si_lk.
The most part of changes was in spell.c in the affix file parsing code.
The following are dictionary structures changes:
- useFlagAliases and flagMode fields had been added to the IspellDict
struct;
- flagval array size had been increased from 256 to 65000;
- flag field of the AFFIX struct also had been increased.
I also had implemented a patch that fixes an error from the e-mail
/messages/by-id/562E1073.8030805@postgrespro.ru
This patch just ignore that error.
Tests
=====
Extention test dictionaries for loading into PostgreSQL and for
normalizing with ts_lexize function can be downloaded from
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz
It would be nice if somebody can do additional tests of dictionaries of
well known languages. Because I do not know many of them.
Other Improvements
==================
There are also some parameters for compound words. But I am not sure
that we want use this parameters.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
hunspell-dict.patchtext/x-patch; name=hunspell-dict.patchDownload
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 237,242 **** cmpaffix(const void *s1, const void *s2)
--- 237,309 ----
(const unsigned char *) a2->repl);
}
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ unsigned short s;
+ char *next;
+
+ switch (Conf->flagMode)
+ {
+ case FM_LONG:
+ s = (int)sflag[0] << 8 | (int)sflag[1];
+ if (sflagnext)
+ *sflagnext = sflag + 2;
+ break;
+ case FM_NUM:
+ s = (unsigned short) strtol(sflag, &next, 10);
+ if (sflagnext)
+ {
+ if (next)
+ {
+ *sflagnext = next;
+ while (**sflagnext)
+ {
+ if (**sflagnext == ',')
+ {
+ *sflagnext = *sflagnext + 1;
+ break;
+ }
+ *sflagnext = *sflagnext + 1;
+ }
+ }
+ else
+ *sflagnext = 0;
+ }
+ break;
+ default:
+ s = (unsigned short) *((unsigned char *)sflag);
+ if (sflagnext)
+ *sflagnext = sflag + 1;
+ }
+
+ return s;
+ }
+
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ char *flagcur;
+ char *flagnext = 0;
+
+ if (affixflag == 0)
+ return true;
+
+ flagcur = Conf->AffixData[affix];
+
+ while (*flagcur)
+ {
+ if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ return true;
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return false;
+ }
+
static void
NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
{
***************
*** 355,361 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
return 1;
}
node = StopMiddle->node;
--- 422,428 ----
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
return 1;
}
node = StopMiddle->node;
***************
*** 394,400 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0)
{
Affix->issimple = 1;
Affix->isregis = 0;
--- 461,467 ----
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0 || *mask == '\0')
{
Affix->issimple = 1;
Affix->isregis = 0;
***************
*** 595,604 **** addFlagValue(IspellDict *Conf, char *s, uint32 val)
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
Conf->usecompound = true;
}
/*
* Import an affix file that follows MySpell or Hunspell format
*/
--- 662,719 ----
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
Conf->usecompound = true;
}
+ static int
+ getFlagValues(IspellDict *Conf, char *s)
+ {
+ uint32 flag = 0;
+ char *flagcur;
+ char *flagnext = 0;
+
+ flagcur = s;
+ while (*flagcur)
+ {
+ flag |= Conf->flagval[decodeFlag(Conf, flagcur, &flagnext)];
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return flag;
+ }
+
+ /*
+ * Get flag set from "s".
+ *
+ * Returns flag set from AffixData array if AF parameter used (useFlagAliases is true).
+ * In this case "s" is alias for flag set.
+ *
+ * Otherwise returns "s".
+ */
+ static char *
+ getFlags(IspellDict *Conf, char *s)
+ {
+ int curaffix;
+ if (Conf->useFlagAliases)
+ {
+ curaffix = strtol(s, (char **)NULL, 10);
+ if (curaffix && curaffix <= Conf->nAffixData)
+ /*
+ * Do not substract 1 from curaffix
+ * because empty string was added in NIImportOOAffixes
+ */
+ return Conf->AffixData[curaffix];
+ else
+ return VoidString;
+ }
+ else
+ return s;
+ }
+
/*
* Import an affix file that follows MySpell or Hunspell format
*/
***************
*** 615,621 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int flag = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
--- 730,740 ----
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int naffix = 0,
! curaffix = 0;
! int flag = 0,
! flagprev = 0,
! sflaglen = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
***************
*** 625,630 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 744,751 ----
/* read file to find any flag */
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
if (!tsearch_readline_begin(&trst, filename))
ereport(ERROR,
***************
*** 672,681 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s && STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default flag value")));
}
pfree(recoded);
--- 793,809 ----
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s)
! {
! if (STRNCMP(s, "long") == 0)
! Conf->flagMode = FM_LONG;
! else if (STRNCMP(s, "num") == 0)
! Conf->flagMode = FM_NUM;
! else if (STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default, long and num flag value")));
! }
}
pfree(recoded);
***************
*** 695,725 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! if (scanread == 4)
{
! if (strlen(sflag) != 1)
! goto nextline;
! flag = *sflag;
isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
else
{
char *ptr;
int aflg = 0;
! if (strlen(sflag) != 1 || flag != *sflag || flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
--- 823,897 ----
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
+ *find = *repl = *mask = '\0';
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
+
+ /* First try to parse AF parameter (alias compression) */
+ if (STRNCMP(ptype, "af") == 0)
+ {
+ /* First line is the number of aliases */
+ if (!Conf->useFlagAliases)
+ {
+ Conf->useFlagAliases = true;
+ naffix = atoi(sflag);
+ if (naffix == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid number of flag vector aliases")));
+
+ /* Also reserve place for empty flag set */
+ naffix++;
+
+ Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
+ Conf->lenAffixData = Conf->nAffixData = naffix;
+
+ /* Add empty flag set into AffixData */
+ Conf->AffixData[curaffix] = VoidString;
+ curaffix++;
+ }
+ /* Other lines is aliases */
+ else
+ {
+ if (curaffix < naffix)
+ {
+ Conf->AffixData[curaffix] = cpstrdup(Conf, sflag);
+ curaffix++;
+ }
+ }
+ goto nextline;
+ }
+ /* Else try to parse prefixes and suffixes */
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! sflaglen = strlen(sflag);
! if (sflaglen == 0
! || (sflaglen > 1 && Conf->flagMode == FM_CHAR)
! || (sflaglen > 2 && Conf->flagMode == FM_LONG))
! goto nextline;
! flag = decodeFlag(Conf, sflag, (char **)NULL);
!
! /* Affix header */
! if (flag != flagprev)
{
! flagprev = flag;
isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
+ /* Affix fields */
else
{
char *ptr;
int aflg = 0;
! if (flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
***************
*** 727,737 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! while (*ptr)
! {
! aflg |= Conf->flagval[*(unsigned char *) ptr];
! ptr++;
! }
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
--- 899,905 ----
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! aflg |= getFlagValues(Conf, getFlags(Conf, ptr));
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
***************
*** 789,794 **** NIImportAffixes(IspellDict *Conf, const char *filename)
--- 957,964 ----
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
while ((recoded = tsearch_readline(&trst)) != NULL)
{
***************
*** 931,946 **** MergeAffix(IspellDict *Conf, int a1, int a2)
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! uint32 flag = 0;
! char *str = Conf->AffixData[affix];
!
! while (str && *str)
! {
! flag |= Conf->flagval[*(unsigned char *) str];
! str++;
! }
!
! return (flag & FF_DICTFLAGMASK);
}
static SPNode *
--- 1101,1108 ----
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! char *str = Conf->AffixData[affix];
! return (getFlagValues(Conf, str) & FF_DICTFLAGMASK);
}
static SPNode *
***************
*** 1032,1071 **** NISortDictionary(IspellDict *Conf)
/* compress affixes */
! /* Count the number of different flags used in the dictionary */
!
! qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
!
! naffix = 0;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
! naffix++;
}
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
{
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
}
! Conf->Spell[i]->p.d.affix = curaffix;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
- Conf->lenAffixData = Conf->nAffixData = naffix;
-
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
}
--- 1194,1249 ----
/* compress affixes */
! /* If we use flag aliases then we need to use Conf->AffixData filled in NIImportOOAffixes */
! if (Conf->useFlagAliases)
{
! for (i = 0; i < Conf->nspell; i++)
! {
! curaffix = strtol(Conf->Spell[i]->p.flag, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! Conf->Spell[i]->p.d.affix = curaffix;
! else
! Conf->Spell[i]->p.d.affix = 0;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
! }
}
+ /* Otherwise fill Conf->AffixData here */
+ else
+ {
+ /* Count the number of different flags used in the dictionary */
+ qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
! naffix = 0;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
! naffix++;
! }
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
! {
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
! }
!
! Conf->Spell[i]->p.d.affix = curaffix;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
! Conf->lenAffixData = Conf->nAffixData = naffix;
}
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
}
***************
*** 1185,1196 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
}
static bool
! isAffixInUse(IspellDict *Conf, char flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (strchr(Conf->AffixData[i], flag) != NULL)
return true;
return false;
--- 1363,1374 ----
}
static bool
! isAffixInUse(IspellDict *Conf, int flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (isAffixFlagInUse(Conf, i, flag))
return true;
return false;
***************
*** 1219,1225 **** NISortAffixes(IspellDict *Conf)
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, (char) Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
--- 1397,1403 ----
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
***************
*** 1685,1691 **** SplitToVariants(IspellDict *Conf, SPNode *snode, SplitVar *orig, char *word, int
if (StopLow < StopHigh)
{
! if (level == FF_COMPOUNDBEGIN)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
--- 1863,1869 ----
if (StopLow < StopHigh)
{
! if (startpos == 0)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
*** a/src/include/tsearch/dicts/spell.h
--- b/src/include/tsearch/dicts/spell.h
***************
*** 77,83 **** typedef struct spell_struct
typedef struct aff_struct
{
! uint32 flag:8,
type:1,
flagflags:7,
issimple:1,
--- 77,83 ----
typedef struct aff_struct
{
! uint32 flag:16,
type:1,
flagflags:7,
issimple:1,
***************
*** 132,137 **** typedef struct
--- 132,144 ----
bool issuffix;
} CMPDAffix;
+ typedef enum
+ {
+ FM_CHAR,
+ FM_LONG,
+ FM_NUM
+ } FlagMode;
+
typedef struct
{
int maffixes;
***************
*** 145,155 **** typedef struct
char **AffixData;
int lenAffixData;
int nAffixData;
CMPDAffix *CompoundAffix;
! unsigned char flagval[256];
bool usecompound;
/*
* Remaining fields are only used during dictionary construction; they are
--- 152,164 ----
char **AffixData;
int lenAffixData;
int nAffixData;
+ bool useFlagAliases;
CMPDAffix *CompoundAffix;
! unsigned char flagval[65000];
bool usecompound;
+ FlagMode flagMode;
/*
* Remaining fields are only used during dictionary construction; they are
hunspell-dict-da_dk.patchtext/x-patch; name=hunspell-dict-da_dk.patchDownload
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 429,443 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
REG_ADVANCED | REG_NOSUB,
DEFAULT_COLLATION_OID);
if (err)
! {
! char errstr[100];
!
! pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr));
! ereport(ERROR,
! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! errmsg("invalid regular expression: %s", errstr)));
! }
}
Affix->flagflags = flagflags;
--- 429,437 ----
err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
REG_ADVANCED | REG_NOSUB,
DEFAULT_COLLATION_OID);
+ /* Ignore regular expression error and do not add wrong affix */
if (err)
! return;
}
Affix->flagflags = flagflags;
06.11.2015 12:33, Artur Zakirov пишет:
Hello again!
Patches
=======
Link to commitfest:
https://commitfest.postgresql.org/8/420/
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for working on this.
I tried the patch with a Turkish dictionary [1]https://tr-spell.googlecode.com/files/dict_aff_5000_suffix_1130000_words.zip I could find on the
Internet. It worked for some words, but not others:
hasegeli=# create text search dictionary hunspell_tr (template = ispell, dictfile = tr, afffile = tr);
CREATE TEXT SEARCH DICTIONARYhasegeli=# select ts_lexize('hunspell_tr', 'tilki'); -- The root "fox"
-----------
{tilki}
(1 row)hasegeli=# select ts_lexize('hunspell_tr', 'tilkinin'); -- Genitive form, affix 3290
ts_lexize
-----------
{tilki}
(1 row)hasegeli=# select ts_lexize('hunspell_tr', 'tilkiler'); -- Plural form, affix 4371
ts_lexize
-----------
{tilki}
(1 row)hasegeli=# select ts_lexize('hunspell_tr', 'tilkiyi'); -- Accusative form, affix 2646
ts_lexize
-----------(1 row)
It seems to have something to do with the order of the affixes. It
works, if I move affix 2646 to the beginning of the list.
[1]: https://tr-spell.googlecode.com/files/dict_aff_5000_suffix_1130000_words.zip
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07.11.2015 17:20, Emre Hasegeli wrote:
It seems to have something to do with the order of the affixes. It
works, if I move affix 2646 to the beginning of the list.[1] https://tr-spell.googlecode.com/files/dict_aff_5000_suffix_1130000_words.zip
Thank you for reply.
This was because of the flag field size of the SPELL struct. And long
flags were being trancated in the .dict file.
I attached new patch. It is temporary patch, not final. It can be done
better.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
hunspell-dict-tr.patchtext/x-patch; name=hunspell-dict-tr.patchDownload
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 153,159 **** cmpspell(const void *s1, const void *s2)
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
}
static char *
--- 153,159 ----
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strcmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag));
}
static char *
***************
*** 237,242 **** cmpaffix(const void *s1, const void *s2)
--- 237,309 ----
(const unsigned char *) a2->repl);
}
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ unsigned short s;
+ char *next;
+
+ switch (Conf->flagMode)
+ {
+ case FM_LONG:
+ s = (int)sflag[0] << 8 | (int)sflag[1];
+ if (sflagnext)
+ *sflagnext = sflag + 2;
+ break;
+ case FM_NUM:
+ s = (unsigned short) strtol(sflag, &next, 10);
+ if (sflagnext)
+ {
+ if (next)
+ {
+ *sflagnext = next;
+ while (**sflagnext)
+ {
+ if (**sflagnext == ',')
+ {
+ *sflagnext = *sflagnext + 1;
+ break;
+ }
+ *sflagnext = *sflagnext + 1;
+ }
+ }
+ else
+ *sflagnext = 0;
+ }
+ break;
+ default:
+ s = (unsigned short) *((unsigned char *)sflag);
+ if (sflagnext)
+ *sflagnext = sflag + 1;
+ }
+
+ return s;
+ }
+
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ char *flagcur;
+ char *flagnext = 0;
+
+ if (affixflag == 0)
+ return true;
+
+ flagcur = Conf->AffixData[affix];
+
+ while (*flagcur)
+ {
+ if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ return true;
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return false;
+ }
+
static void
NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
{
***************
*** 255,261 **** NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
Conf->nspell++;
}
--- 322,328 ----
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! Conf->Spell[Conf->nspell]->flag = cpstrdup(Conf, flag);
Conf->nspell++;
}
***************
*** 355,361 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
return 1;
}
node = StopMiddle->node;
--- 422,428 ----
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
return 1;
}
node = StopMiddle->node;
***************
*** 394,400 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0)
{
Affix->issimple = 1;
Affix->isregis = 0;
--- 461,467 ----
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0 || *mask == '\0')
{
Affix->issimple = 1;
Affix->isregis = 0;
***************
*** 429,443 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
REG_ADVANCED | REG_NOSUB,
DEFAULT_COLLATION_OID);
if (err)
! {
! char errstr[100];
!
! pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr));
! ereport(ERROR,
! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! errmsg("invalid regular expression: %s", errstr)));
! }
}
Affix->flagflags = flagflags;
--- 496,504 ----
err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
REG_ADVANCED | REG_NOSUB,
DEFAULT_COLLATION_OID);
+ /* Ignore regular expression error and do not add wrong affix */
if (err)
! return;
}
Affix->flagflags = flagflags;
***************
*** 595,604 **** addFlagValue(IspellDict *Conf, char *s, uint32 val)
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
Conf->usecompound = true;
}
/*
* Import an affix file that follows MySpell or Hunspell format
*/
--- 656,713 ----
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
Conf->usecompound = true;
}
+ static int
+ getFlagValues(IspellDict *Conf, char *s)
+ {
+ uint32 flag = 0;
+ char *flagcur;
+ char *flagnext = 0;
+
+ flagcur = s;
+ while (*flagcur)
+ {
+ flag |= Conf->flagval[decodeFlag(Conf, flagcur, &flagnext)];
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return flag;
+ }
+
+ /*
+ * Get flag set from "s".
+ *
+ * Returns flag set from AffixData array if AF parameter used (useFlagAliases is true).
+ * In this case "s" is alias for flag set.
+ *
+ * Otherwise returns "s".
+ */
+ static char *
+ getFlags(IspellDict *Conf, char *s)
+ {
+ int curaffix;
+ if (Conf->useFlagAliases)
+ {
+ curaffix = strtol(s, (char **)NULL, 10);
+ if (curaffix && curaffix <= Conf->nAffixData)
+ /*
+ * Do not substract 1 from curaffix
+ * because empty string was added in NIImportOOAffixes
+ */
+ return Conf->AffixData[curaffix];
+ else
+ return VoidString;
+ }
+ else
+ return s;
+ }
+
/*
* Import an affix file that follows MySpell or Hunspell format
*/
***************
*** 615,621 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int flag = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
--- 724,734 ----
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int naffix = 0,
! curaffix = 0;
! int flag = 0,
! flagprev = 0,
! sflaglen = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
***************
*** 625,630 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 738,745 ----
/* read file to find any flag */
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
if (!tsearch_readline_begin(&trst, filename))
ereport(ERROR,
***************
*** 672,681 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s && STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default flag value")));
}
pfree(recoded);
--- 787,803 ----
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s)
! {
! if (STRNCMP(s, "long") == 0)
! Conf->flagMode = FM_LONG;
! else if (STRNCMP(s, "num") == 0)
! Conf->flagMode = FM_NUM;
! else if (STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default, long and num flag value")));
! }
}
pfree(recoded);
***************
*** 695,725 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! if (scanread == 4)
{
! if (strlen(sflag) != 1)
! goto nextline;
! flag = *sflag;
isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
else
{
char *ptr;
int aflg = 0;
! if (strlen(sflag) != 1 || flag != *sflag || flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
--- 817,891 ----
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
+ *find = *repl = *mask = '\0';
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
+
+ /* First try to parse AF parameter (alias compression) */
+ if (STRNCMP(ptype, "af") == 0)
+ {
+ /* First line is the number of aliases */
+ if (!Conf->useFlagAliases)
+ {
+ Conf->useFlagAliases = true;
+ naffix = atoi(sflag);
+ if (naffix == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid number of flag vector aliases")));
+
+ /* Also reserve place for empty flag set */
+ naffix++;
+
+ Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
+ Conf->lenAffixData = Conf->nAffixData = naffix;
+
+ /* Add empty flag set into AffixData */
+ Conf->AffixData[curaffix] = VoidString;
+ curaffix++;
+ }
+ /* Other lines is aliases */
+ else
+ {
+ if (curaffix < naffix)
+ {
+ Conf->AffixData[curaffix] = cpstrdup(Conf, sflag);
+ curaffix++;
+ }
+ }
+ goto nextline;
+ }
+ /* Else try to parse prefixes and suffixes */
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! sflaglen = strlen(sflag);
! if (sflaglen == 0
! || (sflaglen > 1 && Conf->flagMode == FM_CHAR)
! || (sflaglen > 2 && Conf->flagMode == FM_LONG))
! goto nextline;
! flag = decodeFlag(Conf, sflag, (char **)NULL);
!
! /* Affix header */
! if (flag != flagprev)
{
! flagprev = flag;
isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
+ /* Affix fields */
else
{
char *ptr;
int aflg = 0;
! if (flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
***************
*** 727,737 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! while (*ptr)
! {
! aflg |= Conf->flagval[*(unsigned char *) ptr];
! ptr++;
! }
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
--- 893,899 ----
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! aflg |= getFlagValues(Conf, getFlags(Conf, ptr));
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
***************
*** 789,794 **** NIImportAffixes(IspellDict *Conf, const char *filename)
--- 951,958 ----
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
while ((recoded = tsearch_readline(&trst)) != NULL)
{
***************
*** 931,946 **** MergeAffix(IspellDict *Conf, int a1, int a2)
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! uint32 flag = 0;
! char *str = Conf->AffixData[affix];
!
! while (str && *str)
! {
! flag |= Conf->flagval[*(unsigned char *) str];
! str++;
! }
!
! return (flag & FF_DICTFLAGMASK);
}
static SPNode *
--- 1095,1102 ----
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! char *str = Conf->AffixData[affix];
! return (getFlagValues(Conf, str) & FF_DICTFLAGMASK);
}
static SPNode *
***************
*** 1032,1071 **** NISortDictionary(IspellDict *Conf)
/* compress affixes */
! /* Count the number of different flags used in the dictionary */
!
! qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
!
! naffix = 0;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
! naffix++;
}
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
{
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
}
! Conf->Spell[i]->p.d.affix = curaffix;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
- Conf->lenAffixData = Conf->nAffixData = naffix;
-
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
}
--- 1188,1243 ----
/* compress affixes */
! /* If we use flag aliases then we need to use Conf->AffixData filled in NIImportOOAffixes */
! if (Conf->useFlagAliases)
{
! for (i = 0; i < Conf->nspell; i++)
! {
! curaffix = strtol(Conf->Spell[i]->flag, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! Conf->Spell[i]->p.d.affix = curaffix;
! else
! Conf->Spell[i]->p.d.affix = 0;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
! }
}
+ /* Otherwise fill Conf->AffixData here */
+ else
+ {
+ /* Count the number of different flags used in the dictionary */
+ qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
! naffix = 0;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->Spell[i - 1]->flag))
! naffix++;
! }
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->AffixData[curaffix]))
! {
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->flag);
! }
!
! Conf->Spell[i]->p.d.affix = curaffix;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
! Conf->lenAffixData = Conf->nAffixData = naffix;
}
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
}
***************
*** 1185,1196 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
}
static bool
! isAffixInUse(IspellDict *Conf, char flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (strchr(Conf->AffixData[i], flag) != NULL)
return true;
return false;
--- 1357,1368 ----
}
static bool
! isAffixInUse(IspellDict *Conf, int flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (isAffixFlagInUse(Conf, i, flag))
return true;
return false;
***************
*** 1219,1225 **** NISortAffixes(IspellDict *Conf)
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, (char) Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
--- 1391,1397 ----
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
***************
*** 1685,1691 **** SplitToVariants(IspellDict *Conf, SPNode *snode, SplitVar *orig, char *word, int
if (StopLow < StopHigh)
{
! if (level == FF_COMPOUNDBEGIN)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
--- 1857,1863 ----
if (StopLow < StopHigh)
{
! if (startpos == 0)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
*** a/src/include/tsearch/dicts/spell.h
--- b/src/include/tsearch/dicts/spell.h
***************
*** 57,69 **** typedef struct SPNode
typedef struct spell_struct
{
union
{
- /*
- * flag is filled in by NIImportDictionary. After NISortDictionary, d
- * is valid and flag is invalid.
- */
- char flag[MAXFLAGLEN];
struct
{
int affix;
--- 57,69 ----
typedef struct spell_struct
{
+ /*
+ * flag is filled in by NIImportDictionary. After NISortDictionary, d
+ * is valid and flag is invalid.
+ */
+ char *flag;
union
{
struct
{
int affix;
***************
*** 77,83 **** typedef struct spell_struct
typedef struct aff_struct
{
! uint32 flag:8,
type:1,
flagflags:7,
issimple:1,
--- 77,83 ----
typedef struct aff_struct
{
! uint32 flag:16,
type:1,
flagflags:7,
issimple:1,
***************
*** 132,137 **** typedef struct
--- 132,144 ----
bool issuffix;
} CMPDAffix;
+ typedef enum
+ {
+ FM_CHAR,
+ FM_LONG,
+ FM_NUM
+ } FlagMode;
+
typedef struct
{
int maffixes;
***************
*** 145,155 **** typedef struct
char **AffixData;
int lenAffixData;
int nAffixData;
CMPDAffix *CompoundAffix;
! unsigned char flagval[256];
bool usecompound;
/*
* Remaining fields are only used during dictionary construction; they are
--- 152,164 ----
char **AffixData;
int lenAffixData;
int nAffixData;
+ bool useFlagAliases;
CMPDAffix *CompoundAffix;
! unsigned char flagval[65000];
bool usecompound;
+ FlagMode flagMode;
/*
* Remaining fields are only used during dictionary construction; they are
08.11.2015 14:23, Artur Zakirov пишет:
Thank you for reply.
This was because of the flag field size of the SPELL struct. And long
flags were being trancated in the .dict file.I attached new patch. It is temporary patch, not final. It can be done
better.
I have updated the patch and attached it. Now dynamic memory allocation
is used to the flag field of the SPELL struct.
I have valued time of a dictionary loading and memory using by a
dictionary in the new patch. Dictionary is loaded at the first reference
to it. For example, if we execute ts_lexize function. And first
ts_lexize executing takes more time than second.
The following table shows performance of some dictionaries before patch
and after in my computer.
-------------------------------------------------
| | loading time, ms | memory, MB |
| | before | after | before | after |
-------------------------------------------------
|ar | 700 | 300 | 23,7 | 15,7 |
|br_fr | 410 | 450 | 27,4 | 27,5 |
|ca | 248 | 245 | 14,7 | 15,4 |
|en_us | 100 | 100 | 5,4 | 6,2 |
|fr | 160 | 178 | 13,7 | 14,1 |
|gl_es | 160 | 150 | 9 | 9,4 |
|is | 260 | 202 | 16,1 | 16,3 |
-------------------------------------------------
As you can see, substantially loading time and memory using before and
after the patch are same.
Link to patch in commitfest:
https://commitfest.postgresql.org/8/420/
Link to regression tests:
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
hunspell_dict.patchtext/x-patch; name=hunspell_dict.patchDownload
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 153,159 **** cmpspell(const void *s1, const void *s2)
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
}
static char *
--- 153,159 ----
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN));
}
static char *
***************
*** 237,242 **** cmpaffix(const void *s1, const void *s2)
--- 237,309 ----
(const unsigned char *) a2->repl);
}
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ unsigned short s;
+ char *next;
+
+ switch (Conf->flagMode)
+ {
+ case FM_LONG:
+ s = (int)sflag[0] << 8 | (int)sflag[1];
+ if (sflagnext)
+ *sflagnext = sflag + 2;
+ break;
+ case FM_NUM:
+ s = (unsigned short) strtol(sflag, &next, 10);
+ if (sflagnext)
+ {
+ if (next)
+ {
+ *sflagnext = next;
+ while (**sflagnext)
+ {
+ if (**sflagnext == ',')
+ {
+ *sflagnext = *sflagnext + 1;
+ break;
+ }
+ *sflagnext = *sflagnext + 1;
+ }
+ }
+ else
+ *sflagnext = 0;
+ }
+ break;
+ default:
+ s = (unsigned short) *((unsigned char *)sflag);
+ if (sflagnext)
+ *sflagnext = sflag + 1;
+ }
+
+ return s;
+ }
+
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ char *flagcur;
+ char *flagnext = 0;
+
+ if (affixflag == 0)
+ return true;
+
+ flagcur = Conf->AffixData[affix];
+
+ while (*flagcur)
+ {
+ if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ return true;
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return false;
+ }
+
static void
NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
{
***************
*** 255,261 **** NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
Conf->nspell++;
}
--- 322,328 ----
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString;
Conf->nspell++;
}
***************
*** 355,361 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
return 1;
}
node = StopMiddle->node;
--- 422,428 ----
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
return 1;
}
node = StopMiddle->node;
***************
*** 394,400 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0)
{
Affix->issimple = 1;
Affix->isregis = 0;
--- 461,467 ----
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0 || *mask == '\0')
{
Affix->issimple = 1;
Affix->isregis = 0;
***************
*** 429,443 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
REG_ADVANCED | REG_NOSUB,
DEFAULT_COLLATION_OID);
if (err)
! {
! char errstr[100];
!
! pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr));
! ereport(ERROR,
! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! errmsg("invalid regular expression: %s", errstr)));
! }
}
Affix->flagflags = flagflags;
--- 496,504 ----
err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
REG_ADVANCED | REG_NOSUB,
DEFAULT_COLLATION_OID);
+ /* Ignore regular expression error and do not add wrong affix */
if (err)
! return;
}
Affix->flagflags = flagflags;
***************
*** 595,604 **** addFlagValue(IspellDict *Conf, char *s, uint32 val)
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
Conf->usecompound = true;
}
/*
* Import an affix file that follows MySpell or Hunspell format
*/
--- 656,713 ----
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
Conf->usecompound = true;
}
+ static int
+ getFlagValues(IspellDict *Conf, char *s)
+ {
+ uint32 flag = 0;
+ char *flagcur;
+ char *flagnext = 0;
+
+ flagcur = s;
+ while (*flagcur)
+ {
+ flag |= Conf->flagval[decodeFlag(Conf, flagcur, &flagnext)];
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return flag;
+ }
+
+ /*
+ * Get flag set from "s".
+ *
+ * Returns flag set from AffixData array if AF parameter used (useFlagAliases is true).
+ * In this case "s" is alias for flag set.
+ *
+ * Otherwise returns "s".
+ */
+ static char *
+ getFlags(IspellDict *Conf, char *s)
+ {
+ int curaffix;
+ if (Conf->useFlagAliases)
+ {
+ curaffix = strtol(s, (char **)NULL, 10);
+ if (curaffix && curaffix <= Conf->nAffixData)
+ /*
+ * Do not substract 1 from curaffix
+ * because empty string was added in NIImportOOAffixes
+ */
+ return Conf->AffixData[curaffix];
+ else
+ return VoidString;
+ }
+ else
+ return s;
+ }
+
/*
* Import an affix file that follows MySpell or Hunspell format
*/
***************
*** 615,621 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int flag = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
--- 724,734 ----
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int naffix = 0,
! curaffix = 0;
! int flag = 0,
! flagprev = 0,
! sflaglen = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
***************
*** 625,630 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 738,745 ----
/* read file to find any flag */
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
if (!tsearch_readline_begin(&trst, filename))
ereport(ERROR,
***************
*** 672,681 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s && STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default flag value")));
}
pfree(recoded);
--- 787,803 ----
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s)
! {
! if (STRNCMP(s, "long") == 0)
! Conf->flagMode = FM_LONG;
! else if (STRNCMP(s, "num") == 0)
! Conf->flagMode = FM_NUM;
! else if (STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default, long and num flag value")));
! }
}
pfree(recoded);
***************
*** 695,725 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! if (scanread == 4)
{
! if (strlen(sflag) != 1)
! goto nextline;
! flag = *sflag;
isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
else
{
char *ptr;
int aflg = 0;
! if (strlen(sflag) != 1 || flag != *sflag || flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
--- 817,891 ----
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
+ *find = *repl = *mask = '\0';
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
+
+ /* First try to parse AF parameter (alias compression) */
+ if (STRNCMP(ptype, "af") == 0)
+ {
+ /* First line is the number of aliases */
+ if (!Conf->useFlagAliases)
+ {
+ Conf->useFlagAliases = true;
+ naffix = atoi(sflag);
+ if (naffix == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid number of flag vector aliases")));
+
+ /* Also reserve place for empty flag set */
+ naffix++;
+
+ Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
+ Conf->lenAffixData = Conf->nAffixData = naffix;
+
+ /* Add empty flag set into AffixData */
+ Conf->AffixData[curaffix] = VoidString;
+ curaffix++;
+ }
+ /* Other lines is aliases */
+ else
+ {
+ if (curaffix < naffix)
+ {
+ Conf->AffixData[curaffix] = cpstrdup(Conf, sflag);
+ curaffix++;
+ }
+ }
+ goto nextline;
+ }
+ /* Else try to parse prefixes and suffixes */
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! sflaglen = strlen(sflag);
! if (sflaglen == 0
! || (sflaglen > 1 && Conf->flagMode == FM_CHAR)
! || (sflaglen > 2 && Conf->flagMode == FM_LONG))
! goto nextline;
! flag = decodeFlag(Conf, sflag, (char **)NULL);
!
! /* Affix header */
! if (flag != flagprev)
{
! flagprev = flag;
isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
+ /* Affix fields */
else
{
char *ptr;
int aflg = 0;
! if (flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
***************
*** 727,737 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! while (*ptr)
! {
! aflg |= Conf->flagval[*(unsigned char *) ptr];
! ptr++;
! }
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
--- 893,899 ----
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! aflg |= getFlagValues(Conf, getFlags(Conf, ptr));
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
***************
*** 789,794 **** NIImportAffixes(IspellDict *Conf, const char *filename)
--- 951,958 ----
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
while ((recoded = tsearch_readline(&trst)) != NULL)
{
***************
*** 931,946 **** MergeAffix(IspellDict *Conf, int a1, int a2)
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! uint32 flag = 0;
! char *str = Conf->AffixData[affix];
!
! while (str && *str)
! {
! flag |= Conf->flagval[*(unsigned char *) str];
! str++;
! }
!
! return (flag & FF_DICTFLAGMASK);
}
static SPNode *
--- 1095,1102 ----
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! char *str = Conf->AffixData[affix];
! return (getFlagValues(Conf, str) & FF_DICTFLAGMASK);
}
static SPNode *
***************
*** 954,960 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
int lownew = low;
for (i = low; i < high; i++)
! if (Conf->Spell[i]->p.d.len > level && lastchar != Conf->Spell[i]->word[level])
{
nchar++;
lastchar = Conf->Spell[i]->word[level];
--- 1110,1116 ----
int lownew = low;
for (i = low; i < high; i++)
! if (Conf->Spell[i]->d.len > level && lastchar != Conf->Spell[i]->word[level])
{
nchar++;
lastchar = Conf->Spell[i]->word[level];
***************
*** 969,975 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
lastchar = '\0';
for (i = low; i < high; i++)
! if (Conf->Spell[i]->p.d.len > level)
{
if (lastchar != Conf->Spell[i]->word[level])
{
--- 1125,1131 ----
lastchar = '\0';
for (i = low; i < high; i++)
! if (Conf->Spell[i]->d.len > level)
{
if (lastchar != Conf->Spell[i]->word[level])
{
***************
*** 982,992 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
lastchar = Conf->Spell[i]->word[level];
}
data->val = ((uint8 *) (Conf->Spell[i]->word))[level];
! if (Conf->Spell[i]->p.d.len == level + 1)
{
bool clearCompoundOnly = false;
! if (data->isword && data->affix != Conf->Spell[i]->p.d.affix)
{
/*
* MergeAffix called a few times. If one of word is
--- 1138,1148 ----
lastchar = Conf->Spell[i]->word[level];
}
data->val = ((uint8 *) (Conf->Spell[i]->word))[level];
! if (Conf->Spell[i]->d.len == level + 1)
{
bool clearCompoundOnly = false;
! if (data->isword && data->affix != Conf->Spell[i]->d.affix)
{
/*
* MergeAffix called a few times. If one of word is
***************
*** 995,1006 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
*/
clearCompoundOnly = (FF_COMPOUNDONLY & data->compoundflag
! & makeCompoundFlags(Conf, Conf->Spell[i]->p.d.affix))
? false : true;
! data->affix = MergeAffix(Conf, data->affix, Conf->Spell[i]->p.d.affix);
}
else
! data->affix = Conf->Spell[i]->p.d.affix;
data->isword = 1;
data->compoundflag = makeCompoundFlags(Conf, data->affix);
--- 1151,1162 ----
*/
clearCompoundOnly = (FF_COMPOUNDONLY & data->compoundflag
! & makeCompoundFlags(Conf, Conf->Spell[i]->d.affix))
? false : true;
! data->affix = MergeAffix(Conf, data->affix, Conf->Spell[i]->d.affix);
}
else
! data->affix = Conf->Spell[i]->d.affix;
data->isword = 1;
data->compoundflag = makeCompoundFlags(Conf, data->affix);
***************
*** 1032,1070 **** NISortDictionary(IspellDict *Conf)
/* compress affixes */
! /* Count the number of different flags used in the dictionary */
!
! qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
!
! naffix = 0;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
! naffix++;
! }
!
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
*/
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
{
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
}
-
- Conf->Spell[i]->p.d.affix = curaffix;
- Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
! Conf->lenAffixData = Conf->nAffixData = naffix;
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
--- 1188,1244 ----
/* compress affixes */
! /* If we use flag aliases then we need to use Conf->AffixData filled in NIImportOOAffixes.
! * If Conf->Spell[i]->flag is empty, then get empty value of Conf->AffixData (0 index)
*/
! if (Conf->useFlagAliases)
{
! for (i = 0; i < Conf->nspell; i++)
{
! curaffix = strtol(Conf->Spell[i]->flag, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! Conf->Spell[i]->d.affix = curaffix;
! else
! Conf->Spell[i]->d.affix = 0;
! Conf->Spell[i]->d.len = strlen(Conf->Spell[i]->word);
}
}
+ /* Otherwise fill Conf->AffixData here */
+ else
+ {
+ /* Count the number of different flags used in the dictionary */
+ qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
+
+ naffix = 0;
+ for (i = 0; i < Conf->nspell; i++)
+ {
+ if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->Spell[i - 1]->flag))
+ naffix++;
+ }
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->AffixData[curaffix]))
! {
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->flag);
! }
!
! Conf->Spell[i]->d.affix = curaffix;
! Conf->Spell[i]->d.len = strlen(Conf->Spell[i]->word);
! }
!
! Conf->lenAffixData = Conf->nAffixData = naffix;
! }
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
***************
*** 1185,1196 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
}
static bool
! isAffixInUse(IspellDict *Conf, char flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (strchr(Conf->AffixData[i], flag) != NULL)
return true;
return false;
--- 1359,1370 ----
}
static bool
! isAffixInUse(IspellDict *Conf, int flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (isAffixFlagInUse(Conf, i, flag))
return true;
return false;
***************
*** 1219,1225 **** NISortAffixes(IspellDict *Conf)
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, (char) Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
--- 1393,1399 ----
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
***************
*** 1685,1691 **** SplitToVariants(IspellDict *Conf, SPNode *snode, SplitVar *orig, char *word, int
if (StopLow < StopHigh)
{
! if (level == FF_COMPOUNDBEGIN)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
--- 1859,1865 ----
if (StopLow < StopHigh)
{
! if (startpos == 0)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
*** a/src/include/tsearch/dicts/spell.h
--- b/src/include/tsearch/dicts/spell.h
***************
*** 57,75 **** typedef struct SPNode
typedef struct spell_struct
{
! union
{
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
! */
! char flag[MAXFLAGLEN];
! struct
! {
! int affix;
! int len;
! } d;
! } p;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;
--- 57,72 ----
typedef struct spell_struct
{
! struct
{
! int affix;
! int len;
! } d;
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
! */
! char *flag;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;
***************
*** 77,83 **** typedef struct spell_struct
typedef struct aff_struct
{
! uint32 flag:8,
type:1,
flagflags:7,
issimple:1,
--- 74,80 ----
typedef struct aff_struct
{
! uint32 flag:16,
type:1,
flagflags:7,
issimple:1,
***************
*** 132,137 **** typedef struct
--- 129,141 ----
bool issuffix;
} CMPDAffix;
+ typedef enum
+ {
+ FM_CHAR,
+ FM_LONG,
+ FM_NUM
+ } FlagMode;
+
typedef struct
{
int maffixes;
***************
*** 145,155 **** typedef struct
char **AffixData;
int lenAffixData;
int nAffixData;
CMPDAffix *CompoundAffix;
! unsigned char flagval[256];
bool usecompound;
/*
* Remaining fields are only used during dictionary construction; they are
--- 149,161 ----
char **AffixData;
int lenAffixData;
int nAffixData;
+ bool useFlagAliases;
CMPDAffix *CompoundAffix;
! unsigned char flagval[65000];
bool usecompound;
+ FlagMode flagMode;
/*
* Remaining fields are only used during dictionary construction; they are
On 10.11.2015 13:23, Artur Zakirov wrote:
Link to patch in commitfest:
https://commitfest.postgresql.org/8/420/Link to regression tests:
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz
Hello!
Do you have any remarks or comments about my patch?
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16.11.2015 15:51, Artur Zakirov wrote:
On 10.11.2015 13:23, Artur Zakirov wrote:
Link to patch in commitfest:
https://commitfest.postgresql.org/8/420/Link to regression tests:
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz
I have done some changes in documentation in the section "12.6.
Dictionaries". I have added some description how to load Ispell and
Hunspell dictionaries and description about Ispell and Hunspell formats.
Patches for the documentation and for the code are attached separately.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
hunspell_dict.patchtext/x-patch; name=hunspell_dict.patchDownload
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 153,159 **** cmpspell(const void *s1, const void *s2)
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
}
static char *
--- 153,159 ----
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN));
}
static char *
***************
*** 237,242 **** cmpaffix(const void *s1, const void *s2)
--- 237,309 ----
(const unsigned char *) a2->repl);
}
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ unsigned short s;
+ char *next;
+
+ switch (Conf->flagMode)
+ {
+ case FM_LONG:
+ s = (int)sflag[0] << 8 | (int)sflag[1];
+ if (sflagnext)
+ *sflagnext = sflag + 2;
+ break;
+ case FM_NUM:
+ s = (unsigned short) strtol(sflag, &next, 10);
+ if (sflagnext)
+ {
+ if (next)
+ {
+ *sflagnext = next;
+ while (**sflagnext)
+ {
+ if (**sflagnext == ',')
+ {
+ *sflagnext = *sflagnext + 1;
+ break;
+ }
+ *sflagnext = *sflagnext + 1;
+ }
+ }
+ else
+ *sflagnext = 0;
+ }
+ break;
+ default:
+ s = (unsigned short) *((unsigned char *)sflag);
+ if (sflagnext)
+ *sflagnext = sflag + 1;
+ }
+
+ return s;
+ }
+
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ char *flagcur;
+ char *flagnext = 0;
+
+ if (affixflag == 0)
+ return true;
+
+ flagcur = Conf->AffixData[affix];
+
+ while (*flagcur)
+ {
+ if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ return true;
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return false;
+ }
+
static void
NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
{
***************
*** 255,261 **** NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
Conf->nspell++;
}
--- 322,328 ----
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString;
Conf->nspell++;
}
***************
*** 355,361 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
return 1;
}
node = StopMiddle->node;
--- 422,428 ----
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
return 1;
}
node = StopMiddle->node;
***************
*** 394,400 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0)
{
Affix->issimple = 1;
Affix->isregis = 0;
--- 461,467 ----
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0 || *mask == '\0')
{
Affix->issimple = 1;
Affix->isregis = 0;
***************
*** 429,443 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
REG_ADVANCED | REG_NOSUB,
DEFAULT_COLLATION_OID);
if (err)
! {
! char errstr[100];
!
! pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr));
! ereport(ERROR,
! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! errmsg("invalid regular expression: %s", errstr)));
! }
}
Affix->flagflags = flagflags;
--- 496,504 ----
err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
REG_ADVANCED | REG_NOSUB,
DEFAULT_COLLATION_OID);
+ /* Ignore regular expression error and do not add wrong affix */
if (err)
! return;
}
Affix->flagflags = flagflags;
***************
*** 595,604 **** addFlagValue(IspellDict *Conf, char *s, uint32 val)
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
Conf->usecompound = true;
}
/*
* Import an affix file that follows MySpell or Hunspell format
*/
--- 656,713 ----
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
Conf->usecompound = true;
}
+ static int
+ getFlagValues(IspellDict *Conf, char *s)
+ {
+ uint32 flag = 0;
+ char *flagcur;
+ char *flagnext = 0;
+
+ flagcur = s;
+ while (*flagcur)
+ {
+ flag |= Conf->flagval[decodeFlag(Conf, flagcur, &flagnext)];
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return flag;
+ }
+
+ /*
+ * Get flag set from "s".
+ *
+ * Returns flag set from AffixData array if AF parameter used (useFlagAliases is true).
+ * In this case "s" is alias for flag set.
+ *
+ * Otherwise returns "s".
+ */
+ static char *
+ getFlags(IspellDict *Conf, char *s)
+ {
+ int curaffix;
+ if (Conf->useFlagAliases)
+ {
+ curaffix = strtol(s, (char **)NULL, 10);
+ if (curaffix && curaffix <= Conf->nAffixData)
+ /*
+ * Do not substract 1 from curaffix
+ * because empty string was added in NIImportOOAffixes
+ */
+ return Conf->AffixData[curaffix];
+ else
+ return VoidString;
+ }
+ else
+ return s;
+ }
+
/*
* Import an affix file that follows MySpell or Hunspell format
*/
***************
*** 615,621 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int flag = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
--- 724,734 ----
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int naffix = 0,
! curaffix = 0;
! int flag = 0,
! flagprev = 0,
! sflaglen = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
***************
*** 625,630 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 738,745 ----
/* read file to find any flag */
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
if (!tsearch_readline_begin(&trst, filename))
ereport(ERROR,
***************
*** 672,681 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s && STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default flag value")));
}
pfree(recoded);
--- 787,803 ----
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s)
! {
! if (STRNCMP(s, "long") == 0)
! Conf->flagMode = FM_LONG;
! else if (STRNCMP(s, "num") == 0)
! Conf->flagMode = FM_NUM;
! else if (STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default, long and num flag value")));
! }
}
pfree(recoded);
***************
*** 695,725 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! if (scanread == 4)
{
! if (strlen(sflag) != 1)
! goto nextline;
! flag = *sflag;
isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
else
{
char *ptr;
int aflg = 0;
! if (strlen(sflag) != 1 || flag != *sflag || flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
--- 817,891 ----
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
+ *find = *repl = *mask = '\0';
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
+
+ /* First try to parse AF parameter (alias compression) */
+ if (STRNCMP(ptype, "af") == 0)
+ {
+ /* First line is the number of aliases */
+ if (!Conf->useFlagAliases)
+ {
+ Conf->useFlagAliases = true;
+ naffix = atoi(sflag);
+ if (naffix == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid number of flag vector aliases")));
+
+ /* Also reserve place for empty flag set */
+ naffix++;
+
+ Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
+ Conf->lenAffixData = Conf->nAffixData = naffix;
+
+ /* Add empty flag set into AffixData */
+ Conf->AffixData[curaffix] = VoidString;
+ curaffix++;
+ }
+ /* Other lines is aliases */
+ else
+ {
+ if (curaffix < naffix)
+ {
+ Conf->AffixData[curaffix] = cpstrdup(Conf, sflag);
+ curaffix++;
+ }
+ }
+ goto nextline;
+ }
+ /* Else try to parse prefixes and suffixes */
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! sflaglen = strlen(sflag);
! if (sflaglen == 0
! || (sflaglen > 1 && Conf->flagMode == FM_CHAR)
! || (sflaglen > 2 && Conf->flagMode == FM_LONG))
! goto nextline;
! flag = decodeFlag(Conf, sflag, (char **)NULL);
!
! /* Affix header */
! if (flag != flagprev)
{
! flagprev = flag;
isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
+ /* Affix fields */
else
{
char *ptr;
int aflg = 0;
! if (flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
***************
*** 727,737 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! while (*ptr)
! {
! aflg |= Conf->flagval[*(unsigned char *) ptr];
! ptr++;
! }
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
--- 893,899 ----
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! aflg |= getFlagValues(Conf, getFlags(Conf, ptr));
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
***************
*** 789,794 **** NIImportAffixes(IspellDict *Conf, const char *filename)
--- 951,958 ----
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
while ((recoded = tsearch_readline(&trst)) != NULL)
{
***************
*** 931,946 **** MergeAffix(IspellDict *Conf, int a1, int a2)
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! uint32 flag = 0;
! char *str = Conf->AffixData[affix];
!
! while (str && *str)
! {
! flag |= Conf->flagval[*(unsigned char *) str];
! str++;
! }
!
! return (flag & FF_DICTFLAGMASK);
}
static SPNode *
--- 1095,1102 ----
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! char *str = Conf->AffixData[affix];
! return (getFlagValues(Conf, str) & FF_DICTFLAGMASK);
}
static SPNode *
***************
*** 954,960 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
int lownew = low;
for (i = low; i < high; i++)
! if (Conf->Spell[i]->p.d.len > level && lastchar != Conf->Spell[i]->word[level])
{
nchar++;
lastchar = Conf->Spell[i]->word[level];
--- 1110,1116 ----
int lownew = low;
for (i = low; i < high; i++)
! if (Conf->Spell[i]->d.len > level && lastchar != Conf->Spell[i]->word[level])
{
nchar++;
lastchar = Conf->Spell[i]->word[level];
***************
*** 969,975 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
lastchar = '\0';
for (i = low; i < high; i++)
! if (Conf->Spell[i]->p.d.len > level)
{
if (lastchar != Conf->Spell[i]->word[level])
{
--- 1125,1131 ----
lastchar = '\0';
for (i = low; i < high; i++)
! if (Conf->Spell[i]->d.len > level)
{
if (lastchar != Conf->Spell[i]->word[level])
{
***************
*** 982,992 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
lastchar = Conf->Spell[i]->word[level];
}
data->val = ((uint8 *) (Conf->Spell[i]->word))[level];
! if (Conf->Spell[i]->p.d.len == level + 1)
{
bool clearCompoundOnly = false;
! if (data->isword && data->affix != Conf->Spell[i]->p.d.affix)
{
/*
* MergeAffix called a few times. If one of word is
--- 1138,1148 ----
lastchar = Conf->Spell[i]->word[level];
}
data->val = ((uint8 *) (Conf->Spell[i]->word))[level];
! if (Conf->Spell[i]->d.len == level + 1)
{
bool clearCompoundOnly = false;
! if (data->isword && data->affix != Conf->Spell[i]->d.affix)
{
/*
* MergeAffix called a few times. If one of word is
***************
*** 995,1006 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
*/
clearCompoundOnly = (FF_COMPOUNDONLY & data->compoundflag
! & makeCompoundFlags(Conf, Conf->Spell[i]->p.d.affix))
? false : true;
! data->affix = MergeAffix(Conf, data->affix, Conf->Spell[i]->p.d.affix);
}
else
! data->affix = Conf->Spell[i]->p.d.affix;
data->isword = 1;
data->compoundflag = makeCompoundFlags(Conf, data->affix);
--- 1151,1162 ----
*/
clearCompoundOnly = (FF_COMPOUNDONLY & data->compoundflag
! & makeCompoundFlags(Conf, Conf->Spell[i]->d.affix))
? false : true;
! data->affix = MergeAffix(Conf, data->affix, Conf->Spell[i]->d.affix);
}
else
! data->affix = Conf->Spell[i]->d.affix;
data->isword = 1;
data->compoundflag = makeCompoundFlags(Conf, data->affix);
***************
*** 1032,1070 **** NISortDictionary(IspellDict *Conf)
/* compress affixes */
! /* Count the number of different flags used in the dictionary */
!
! qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
!
! naffix = 0;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
! naffix++;
! }
!
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
*/
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
{
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
}
-
- Conf->Spell[i]->p.d.affix = curaffix;
- Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
! Conf->lenAffixData = Conf->nAffixData = naffix;
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
--- 1188,1244 ----
/* compress affixes */
! /* If we use flag aliases then we need to use Conf->AffixData filled in NIImportOOAffixes.
! * If Conf->Spell[i]->flag is empty, then get empty value of Conf->AffixData (0 index)
*/
! if (Conf->useFlagAliases)
{
! for (i = 0; i < Conf->nspell; i++)
{
! curaffix = strtol(Conf->Spell[i]->flag, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! Conf->Spell[i]->d.affix = curaffix;
! else
! Conf->Spell[i]->d.affix = 0;
! Conf->Spell[i]->d.len = strlen(Conf->Spell[i]->word);
}
}
+ /* Otherwise fill Conf->AffixData here */
+ else
+ {
+ /* Count the number of different flags used in the dictionary */
+ qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
+
+ naffix = 0;
+ for (i = 0; i < Conf->nspell; i++)
+ {
+ if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->Spell[i - 1]->flag))
+ naffix++;
+ }
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->AffixData[curaffix]))
! {
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->flag);
! }
!
! Conf->Spell[i]->d.affix = curaffix;
! Conf->Spell[i]->d.len = strlen(Conf->Spell[i]->word);
! }
!
! Conf->lenAffixData = Conf->nAffixData = naffix;
! }
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
***************
*** 1185,1196 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
}
static bool
! isAffixInUse(IspellDict *Conf, char flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (strchr(Conf->AffixData[i], flag) != NULL)
return true;
return false;
--- 1359,1370 ----
}
static bool
! isAffixInUse(IspellDict *Conf, int flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (isAffixFlagInUse(Conf, i, flag))
return true;
return false;
***************
*** 1219,1225 **** NISortAffixes(IspellDict *Conf)
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, (char) Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
--- 1393,1399 ----
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
***************
*** 1685,1691 **** SplitToVariants(IspellDict *Conf, SPNode *snode, SplitVar *orig, char *word, int
if (StopLow < StopHigh)
{
! if (level == FF_COMPOUNDBEGIN)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
--- 1859,1865 ----
if (StopLow < StopHigh)
{
! if (startpos == 0)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
*** a/src/include/tsearch/dicts/spell.h
--- b/src/include/tsearch/dicts/spell.h
***************
*** 57,75 **** typedef struct SPNode
typedef struct spell_struct
{
! union
{
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
! */
! char flag[MAXFLAGLEN];
! struct
! {
! int affix;
! int len;
! } d;
! } p;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;
--- 57,72 ----
typedef struct spell_struct
{
! struct
{
! int affix;
! int len;
! } d;
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
! */
! char *flag;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;
***************
*** 77,83 **** typedef struct spell_struct
typedef struct aff_struct
{
! uint32 flag:8,
type:1,
flagflags:7,
issimple:1,
--- 74,80 ----
typedef struct aff_struct
{
! uint32 flag:16,
type:1,
flagflags:7,
issimple:1,
***************
*** 132,137 **** typedef struct
--- 129,141 ----
bool issuffix;
} CMPDAffix;
+ typedef enum
+ {
+ FM_CHAR,
+ FM_LONG,
+ FM_NUM
+ } FlagMode;
+
typedef struct
{
int maffixes;
***************
*** 145,155 **** typedef struct
char **AffixData;
int lenAffixData;
int nAffixData;
CMPDAffix *CompoundAffix;
! unsigned char flagval[256];
bool usecompound;
/*
* Remaining fields are only used during dictionary construction; they are
--- 149,161 ----
char **AffixData;
int lenAffixData;
int nAffixData;
+ bool useFlagAliases;
CMPDAffix *CompoundAffix;
! unsigned char flagval[65000];
bool usecompound;
+ FlagMode flagMode;
/*
* Remaining fields are only used during dictionary construction; they are
hunspell_dict_doc.patchtext/x-patch; name=hunspell_dict_doc.patchDownload
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***************
*** 2615,2632 **** SELECT plainto_tsquery('supernova star');
</para>
<para>
! To create an <application>Ispell</> dictionary, use the built-in
! <literal>ispell</literal> template and specify several parameters:
</para>
!
<programlisting>
! CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
</programlisting>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
--- 2615,2655 ----
</para>
<para>
! To create an <application>Ispell</> dictionary perform these steps:
</para>
! <itemizedlist spacing="compact" mark="bullet">
! <listitem>
! <para>
! download dictionary configuration files. <productname>OpenOffice</>
! extension files have the <filename>.oxt</> extension. It is necessary
! to extract <filename>.aff</> and <filename>.dic</> files, change extensions
! to <filename>.affix</> and <filename>.dict</>. For some dictionary
! files it is also needed to convert characters to the UTF-8 encoding
! with commands (for example, for norwegian language dictionary):
<programlisting>
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! </programlisting>
! </para>
! </listitem>
! <listitem>
! <para>
! copy files to the <filename>$SHAREDIR/tsearch_data</> directory
! </para>
! </listitem>
! <listitem>
! <para>
! load files into PostgreSQL with the following command:
! <programlisting>
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
</programlisting>
+ </para>
+ </listitem>
+ </itemizedlist>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
***************
*** 2643,2648 **** CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 ----
</para>
<para>
+ The <filename>.affix</> file of <application>Ispell</> has the following structure:
+ <programlisting>
+ prefixes
+ flag *A:
+ . > RE # As in enter > reenter
+ suffixes
+ flag T:
+ E > ST # As in late > latest
+ [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y > EST # As in gray > grayest
+ [^EY] > EST # As in small > smallest
+ </programlisting>
+ </para>
+ <para>
+ And the <filename>.dict</> file has the following structure:
+ <programlisting>
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ </programlisting>
+ </para>
+
+ <para>
+ Format of the <filename>.dict</> file is:
+ <programlisting>
+ basic_form/affix_class_name
+ </programlisting>
+ </para>
+
+ <para>
+ In the <filename>.affix</> file every affix flag is described in the
+ following format:
+ <programlisting>
+ condition > [-stripping_letters,] adding_affix
+ </programlisting>
+ </para>
+
+ <para>
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings <literal>[...]</> and <literal>[^...]</>.
+ For example, <literal>[AEIOU]Y</> means that the last letter of the word
+ is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
+ <literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
+ <literal>[^EY]</> means that the last letter is neither <literal>"e"</>
+ nor <literal>"y"</>.
+ </para>
+
+ <para>
Ispell dictionaries support splitting compound words;
a useful feature.
Notice that the affix file should specify a special flag using the
***************
*** 2663,2668 **** SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2735,2796 ----
</programlisting>
</para>
+ <para>
+ <application>MySpell</> is very similar to <application>Hunspell</>.
+ The <filename>.affix</> file of <application>Hunspell</> has the following structure:
+ <programlisting>
+ PFX A Y 1
+ PFX A 0 re .
+ SFX T N 4
+ SFX T 0 st e
+ SFX T y iest [^aeiou]y
+ SFX T 0 est [aeiou]y
+ SFX T 0 est [^ey]
+ </programlisting>
+ </para>
+
+ <para>
+ The first line of an affix class is the header. Fields of an affix rules are listed after the header:
+ </para>
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ parameter name (PFX or SFX)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ flag (name of the affix class)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ stripping characters from beginning (at prefix) or end (at suffix) of the word
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ adding affix
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ condition that has a format similar to the format of regular expressions.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ The <filename>.dict</> file looks like the <filename>.dict</> file of
+ <application>Ispell</>:
+ <programlisting>
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ </programlisting>
+ </para>
+
<note>
<para>
<application>MySpell</> does not support compound words.
Artur Zakirov wrote:
*** 77,83 **** typedef struct spell_struct
typedef struct aff_struct { ! uint32 flag:8, type:1, flagflags:7, issimple:1, --- 74,80 ----typedef struct aff_struct
{
! uint32 flag:16,
type:1,
flagflags:7,
issimple:1,
By doing this, you're using 40 bits of a 32-bits-wide field. What does
this mean? Are the final 8 bits lost? Does the compiler allocate a
second uint32 member for those additional bits? I don't know, but I
don't think this is a very clean idea.
typedef struct spell_struct
{
! union
{
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
! */
! char flag[MAXFLAGLEN];
! struct
! {
! int affix;
! int len;
! } d;
! } p;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;--- 57,72 ----typedef struct spell_struct
{
! struct
{
! int affix;
! int len;
! } d;
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
! */
! char *flag;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;
Here you removed the union, with no rationale for doing so. Why did you
do it? Can it be avoided? Because of the comment, I'd imagine that d
and flag are valid at different times, so at any time we care about only
one of them; but you haven't updated the comment stating the reason for
that no longer to be the case. I suspect you need to keep flag valid
after NISortDictionary has been called, but if so why? If "flag" is
invalid as the comment says, what's the reason for keeping it?
The routines decodeFlag and isAffixFlagInUse could do with more
comments. Your patch adds zero. Actually the whole file has not nearly
enough comments; adding some more would be very good.
Actually, after some more reading, I think this code is pretty terrible.
I have a hard time figuring out how the original works, which makes it
even more difficult to figure out whether your changes make sense. I
would have to take your patch on faith, which doesn't sound so great an
idea.
palloc / cpalloc / tmpalloc make the whole mess even more confusing.
Why does this file have three ways to allocate memory?
Not sure what's a good way to go about this. I am certainly not going
to commit this as is, because if I do whatever bugs you have are going
to become my problem; and with the severe lack of documentation and
given how fiddly this stuff is, I bet there are going to be a bunch of
bugs. I suspect most committers are going to be in the same position.
I think you should start by adding a few comments here and there on top
of the original to explain how it works, then your patch on top. I
suppose it's going to be a lot of work for you but I don't see any other
way.
A top-level overview about it would be good, too. The current comment
at top of file states:
* spell.c
* Normalizing word with ISpell
which is, err, somewhat laconic.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Artur Zakirov wrote:
Now almost all dictionaries are loaded into PostgreSQL. But the da_dk
dictionary does not load. I see the following error:ERROR: invalid regular expression: quantifier operand invalid
CONTEXT: line 439 of configuration file
"/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s
+GENITIVIf you open the affix file in editor you can see that there is incorrect
format of the affix 55 in 439 line (screen1.png):
[ another email ]
I also had implemented a patch that fixes an error from the e-mail
/messages/by-id/562E1073.8030805@postgrespro.ru
This patch just ignore that error.
I think it's a bad idea to just ignore these syntax errors. This affix
file is effectively corrupt, after all, so it seems a bad idea that we
need to cope with it. I think it would be better to raise the error
normally and instruct the user to fix the file; obviously it's better if
the upstream provider of the file fixes it.
Now, if there is proof somewhere that the file is correct, then the code
must cope in some reasonable way. But in any case I don't think this
change is acceptable ... it can only cause pain, in the long run.
*** 429,443 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
REG_ADVANCED | REG_NOSUB,
DEFAULT_COLLATION_OID);
if (err)
! {
! char errstr[100];
!
! pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr));
! ereport(ERROR,
! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! errmsg("invalid regular expression: %s", errstr)));
! }
}Affix->flagflags = flagflags; --- 429,437 ---- err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, REG_ADVANCED | REG_NOSUB, DEFAULT_COLLATION_OID); + /* Ignore regular expression error and do not add wrong affix */ if (err) ! return; }Affix->flagflags = flagflags;
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: CAF4Au4zy7S_bvFcU+W_e_R9UZKJ7kzRQUNHix8xFYYmzjMY_xQ@mail.gmail.com56352972.9020608@postgrespro.ru563C73F7.5000202@postgrespro.ru562E1073.8030805@postgrespro.ru | Resolved by subject fallback
Thanks for review.
On 09.01.2016 02:04, Alvaro Herrera wrote:
Artur Zakirov wrote:
--- 74,80 ----typedef struct aff_struct
{
! uint32 flag:16,
type:1,
flagflags:7,
issimple:1,By doing this, you're using 40 bits of a 32-bits-wide field. What does
this mean? Are the final 8 bits lost? Does the compiler allocate a
second uint32 member for those additional bits? I don't know, but I
don't think this is a very clean idea.
No, 8 bits are not lost. This 8 bits are used if a dictionary uses
double extended ASCII character flag type (Conf->flagMode == FM_LONG) or
decimal number flag type (Conf->flagMode == FM_NUM). If a dictionary
uses single extended ASCII character flag type (Conf->flagMode ==
FM_CHAR), then 8 bits lost.
You can see it in decodeFlag function. This function is used in
NIImportOOAffixes function, decode affix flag from string type and store
in flag field (flag:16).
typedef struct spell_struct
{
! struct
{
! int affix;
! int len;
! } d;
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
! */
! char *flag;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;Here you removed the union, with no rationale for doing so. Why did you
do it? Can it be avoided? Because of the comment, I'd imagine that d
and flag are valid at different times, so at any time we care about only
one of them; but you haven't updated the comment stating the reason for
that no longer to be the case. I suspect you need to keep flag valid
after NISortDictionary has been called, but if so why? If "flag" is
invalid as the comment says, what's the reason for keeping it?
Union was removed because the flag field need to be dynamically sized.
It had 16 size in the previous version. In this field flag set are
stored. For example, if .dict file has the entry:
mitigate/NDSGny
Then the "NDSGny" is stored in the flag field.
But in some cases a flag set can have size bigger than 16. I added this
changes after this message
/messages/by-id/CAE2gYzwom3=11U9G8ZxMT5PLkZrwb12BWzxh4dB3HUd89FOSrg@mail.gmail.com
In that Turkish dictionary there are can be large flag set. For example:
�zek/2240,852,749,5026,2242,4455,2594,2597,4963,1608,494,2409
This flag set has 56 size.
This "flag" is valid all the time. It is used in NISortDictionary and it
is not used after NISortDictionary function has been called. Maybe you
right and there are no reason for keeping it, and it is necessary to
store all flags in separate variable, that will be deleted after
NISortDictionary has been called.
The routines decodeFlag and isAffixFlagInUse could do with more
comments. Your patch adds zero. Actually the whole file has not nearly
enough comments; adding some more would be very good.Actually, after some more reading, I think this code is pretty terrible.
I have a hard time figuring out how the original works, which makes it
even more difficult to figure out whether your changes make sense. I
would have to take your patch on faith, which doesn't sound so great an
idea.palloc / cpalloc / tmpalloc make the whole mess even more confusing.
Why does this file have three ways to allocate memory?Not sure what's a good way to go about this. I am certainly not going
to commit this as is, because if I do whatever bugs you have are going
to become my problem; and with the severe lack of documentation and
given how fiddly this stuff is, I bet there are going to be a bunch of
bugs. I suspect most committers are going to be in the same position.
I think you should start by adding a few comments here and there on top
of the original to explain how it works, then your patch on top. I
suppose it's going to be a lot of work for you but I don't see any other
way.A top-level overview about it would be good, too. The current comment
at top of file states:* spell.c
* Normalizing word with ISpellwhich is, err, somewhat laconic.
I will provide comments and explain how it works in comments. Maybe I
will add some explanation about dictionaries structure. I will update
the patch soon.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
On 09.01.2016 05:38, Alvaro Herrera wrote:
Artur Zakirov wrote:
Now almost all dictionaries are loaded into PostgreSQL. But the da_dk
dictionary does not load. I see the following error:ERROR: invalid regular expression: quantifier operand invalid
CONTEXT: line 439 of configuration file
"/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s
+GENITIVIf you open the affix file in editor you can see that there is incorrect
format of the affix 55 in 439 line (screen1.png):[ another email ]
I also had implemented a patch that fixes an error from the e-mail
/messages/by-id/562E1073.8030805@postgrespro.ru
This patch just ignore that error.I think it's a bad idea to just ignore these syntax errors. This affix
file is effectively corrupt, after all, so it seems a bad idea that we
need to cope with it. I think it would be better to raise the error
normally and instruct the user to fix the file; obviously it's better if
the upstream provider of the file fixes it.Now, if there is proof somewhere that the file is correct, then the code
must cope in some reasonable way. But in any case I don't think this
change is acceptable ... it can only cause pain, in the long run.
This error is raised in Danish dictionary because of erroneous entry in
the .affix file. I sent a bug-report to developer. He fixed this bug.
Corrected dictionary can be downloaded from LibreOffice site.
I undo the changes and the error will be raised. I will update the patch
soon.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Artur Zakirov wrote:
I undo the changes and the error will be raised. I will update the patch
soon.
I don't think you ever did this. I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28.01.2016 14:19, Alvaro Herrera wrote:
Artur Zakirov wrote:
I undo the changes and the error will be raised. I will update the patch
soon.I don't think you ever did this. I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.
I'm working on the patch. I wanted to send this changes after all changes.
This version of the patch has a top-level comment. Another comments I
will provides soon.
Also this patch has some changes with ternary operators.
I don't think you ever did this. I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.
Moved to next CF.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
hunspell_dict_v5.patchtext/x-patch; name=hunspell_dict_v5.patchDownload
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***************
*** 2615,2632 **** SELECT plainto_tsquery('supernova star');
</para>
<para>
! To create an <application>Ispell</> dictionary, use the built-in
! <literal>ispell</literal> template and specify several parameters:
</para>
!
<programlisting>
! CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
</programlisting>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
--- 2615,2655 ----
</para>
<para>
! To create an <application>Ispell</> dictionary perform these steps:
</para>
! <itemizedlist spacing="compact" mark="bullet">
! <listitem>
! <para>
! download dictionary configuration files. <productname>OpenOffice</>
! extension files have the <filename>.oxt</> extension. It is necessary
! to extract <filename>.aff</> and <filename>.dic</> files, change extensions
! to <filename>.affix</> and <filename>.dict</>. For some dictionary
! files it is also needed to convert characters to the UTF-8 encoding
! with commands (for example, for norwegian language dictionary):
<programlisting>
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! </programlisting>
! </para>
! </listitem>
! <listitem>
! <para>
! copy files to the <filename>$SHAREDIR/tsearch_data</> directory
! </para>
! </listitem>
! <listitem>
! <para>
! load files into PostgreSQL with the following command:
! <programlisting>
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
</programlisting>
+ </para>
+ </listitem>
+ </itemizedlist>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
***************
*** 2643,2648 **** CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 ----
</para>
<para>
+ The <filename>.affix</> file of <application>Ispell</> has the following structure:
+ <programlisting>
+ prefixes
+ flag *A:
+ . > RE # As in enter > reenter
+ suffixes
+ flag T:
+ E > ST # As in late > latest
+ [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y > EST # As in gray > grayest
+ [^EY] > EST # As in small > smallest
+ </programlisting>
+ </para>
+ <para>
+ And the <filename>.dict</> file has the following structure:
+ <programlisting>
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ </programlisting>
+ </para>
+
+ <para>
+ Format of the <filename>.dict</> file is:
+ <programlisting>
+ basic_form/affix_class_name
+ </programlisting>
+ </para>
+
+ <para>
+ In the <filename>.affix</> file every affix flag is described in the
+ following format:
+ <programlisting>
+ condition > [-stripping_letters,] adding_affix
+ </programlisting>
+ </para>
+
+ <para>
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings <literal>[...]</> and <literal>[^...]</>.
+ For example, <literal>[AEIOU]Y</> means that the last letter of the word
+ is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
+ <literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
+ <literal>[^EY]</> means that the last letter is neither <literal>"e"</>
+ nor <literal>"y"</>.
+ </para>
+
+ <para>
Ispell dictionaries support splitting compound words;
a useful feature.
Notice that the affix file should specify a special flag using the
***************
*** 2663,2668 **** SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2735,2796 ----
</programlisting>
</para>
+ <para>
+ <application>MySpell</> is very similar to <application>Hunspell</>.
+ The <filename>.affix</> file of <application>Hunspell</> has the following structure:
+ <programlisting>
+ PFX A Y 1
+ PFX A 0 re .
+ SFX T N 4
+ SFX T 0 st e
+ SFX T y iest [^aeiou]y
+ SFX T 0 est [aeiou]y
+ SFX T 0 est [^ey]
+ </programlisting>
+ </para>
+
+ <para>
+ The first line of an affix class is the header. Fields of an affix rules are listed after the header:
+ </para>
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ parameter name (PFX or SFX)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ flag (name of the affix class)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ stripping characters from beginning (at prefix) or end (at suffix) of the word
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ adding affix
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ condition that has a format similar to the format of regular expressions.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ The <filename>.dict</> file looks like the <filename>.dict</> file of
+ <application>Ispell</>:
+ <programlisting>
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ </programlisting>
+ </para>
+
<note>
<para>
<application>MySpell</> does not support compound words.
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 5,10 ****
--- 5,54 ----
*
* Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
*
+ * Ispell dictionary
+ * --------------------------------
+ *
+ * Rules of dictionaries are defined in two files with .affix and .dict
+ * extensions. They are used by spell checker programs Ispell and Hunspell.
+ *
+ * An .affix file declares morphological rules to get a basic form of words.
+ * The format of an .affix file has different structure for Ispell and Hunspell
+ * dictionaries. The Hunspell format is more complicated. But when an .affix
+ * file is imported and compiled, it is stored in the same structure AffixNode.
+ *
+ * A .dict file stores a list of basic forms of words with references to
+ * affix rules. The format of a .dict file has the same structure for Ispell
+ * and Hunspell dictionaries.
+ *
+ * Compilation of a dictionary
+ * ---------------------------
+ *
+ * A compiled dictionary is stored in the IspellDict structure. Compilation of
+ * a dictionary is divided into the several steps:
+ * - NIImportDictionary() - stores each word of a .dict file in the
+ * temporary Spell field.
+ * - NIImportAffixes() - stores affix rules of an .affix file in the
+ * Affix field (not temporary) if an .affix file has the Ispell format.
+ * -> NIImportOOAffixes() - stores affix rules if an .affix file has the
+ * Hunspell format. The AffixData field is initialized if AF parameter
+ * is defined.
+ * - NISortDictionary() - builds a prefix tree (Trie) from the words list
+ * and stores it in the Dictionary field. The AffixData field is initialized
+ * if AF parameter is not defined.
+ * - NISortAffixes():
+ * - builds a list of compond affixes and stores it in the CompoundAffix.
+ * - builds prefix trees (Trie) from the affix list for prefixes and suffixes
+ * and stores them in Suffix and Prefix fields.
+ *
+ * Memory management
+ * -----------------
+ *
+ * The IspellDict structure has the Spell field which is used only in compile
+ * time. The Spell field stores a words list. It can take a lot of memory.
+ * Therefore when a dictionary is compiled this field is cleared by NIFinishBuild.
+ *
+ * All resources which should cleared by NIFinishBuild is initialized using
+ * tmpalloc() and tmpalloc0().
*
* IDENTIFICATION
* src/backend/tsearch/spell.c
***************
*** 153,159 **** cmpspell(const void *s1, const void *s2)
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
}
static char *
--- 197,203 ----
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN));
}
static char *
***************
*** 237,242 **** cmpaffix(const void *s1, const void *s2)
--- 281,353 ----
(const unsigned char *) a2->repl);
}
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ unsigned short s;
+ char *next;
+
+ switch (Conf->flagMode)
+ {
+ case FM_LONG:
+ s = (int)sflag[0] << 8 | (int)sflag[1];
+ if (sflagnext)
+ *sflagnext = sflag + 2;
+ break;
+ case FM_NUM:
+ s = (unsigned short) strtol(sflag, &next, 10);
+ if (sflagnext)
+ {
+ if (next)
+ {
+ *sflagnext = next;
+ while (**sflagnext)
+ {
+ if (**sflagnext == ',')
+ {
+ *sflagnext = *sflagnext + 1;
+ break;
+ }
+ *sflagnext = *sflagnext + 1;
+ }
+ }
+ else
+ *sflagnext = 0;
+ }
+ break;
+ default:
+ s = (unsigned short) *((unsigned char *)sflag);
+ if (sflagnext)
+ *sflagnext = sflag + 1;
+ }
+
+ return s;
+ }
+
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ char *flagcur;
+ char *flagnext = 0;
+
+ if (affixflag == 0)
+ return true;
+
+ flagcur = Conf->AffixData[affix];
+
+ while (*flagcur)
+ {
+ if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ return true;
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return false;
+ }
+
static void
NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
{
***************
*** 255,261 **** NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
Conf->nspell++;
}
--- 366,372 ----
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString;
Conf->nspell++;
}
***************
*** 355,361 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
return 1;
}
node = StopMiddle->node;
--- 466,472 ----
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
return 1;
}
node = StopMiddle->node;
***************
*** 394,400 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0)
{
Affix->issimple = 1;
Affix->isregis = 0;
--- 505,511 ----
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0 || *mask == '\0')
{
Affix->issimple = 1;
Affix->isregis = 0;
***************
*** 403,409 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
{
Affix->issimple = 0;
Affix->isregis = 1;
! RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX) ? true : false,
*mask ? mask : VoidString);
}
else
--- 514,520 ----
{
Affix->issimple = 0;
Affix->isregis = 1;
! RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX),
*mask ? mask : VoidString);
}
else
***************
*** 576,582 **** parse_affentry(char *str, char *mask, char *find, char *repl)
*pmask = *pfind = *prepl = '\0';
! return (*mask && (*find || *repl)) ? true : false;
}
static void
--- 687,693 ----
*pmask = *pfind = *prepl = '\0';
! return (*mask && (*find || *repl));
}
static void
***************
*** 595,604 **** addFlagValue(IspellDict *Conf, char *s, uint32 val)
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
Conf->usecompound = true;
}
/*
* Import an affix file that follows MySpell or Hunspell format
*/
--- 706,763 ----
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
Conf->usecompound = true;
}
+ static int
+ getFlagValues(IspellDict *Conf, char *s)
+ {
+ uint32 flag = 0;
+ char *flagcur;
+ char *flagnext = 0;
+
+ flagcur = s;
+ while (*flagcur)
+ {
+ flag |= Conf->flagval[decodeFlag(Conf, flagcur, &flagnext)];
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return flag;
+ }
+
+ /*
+ * Get flag set from "s".
+ *
+ * Returns flag set from AffixData array if AF parameter used (useFlagAliases is true).
+ * In this case "s" is alias for flag set.
+ *
+ * Otherwise returns "s".
+ */
+ static char *
+ getFlags(IspellDict *Conf, char *s)
+ {
+ int curaffix;
+ if (Conf->useFlagAliases)
+ {
+ curaffix = strtol(s, (char **)NULL, 10);
+ if (curaffix && curaffix <= Conf->nAffixData)
+ /*
+ * Do not substract 1 from curaffix
+ * because empty string was added in NIImportOOAffixes
+ */
+ return Conf->AffixData[curaffix];
+ else
+ return VoidString;
+ }
+ else
+ return s;
+ }
+
/*
* Import an affix file that follows MySpell or Hunspell format
*/
***************
*** 615,621 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int flag = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
--- 774,784 ----
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int naffix = 0,
! curaffix = 0;
! int flag = 0,
! flagprev = 0,
! sflaglen = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
***************
*** 625,630 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 788,795 ----
/* read file to find any flag */
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
if (!tsearch_readline_begin(&trst, filename))
ereport(ERROR,
***************
*** 672,681 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s && STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default flag value")));
}
pfree(recoded);
--- 837,853 ----
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s)
! {
! if (STRNCMP(s, "long") == 0)
! Conf->flagMode = FM_LONG;
! else if (STRNCMP(s, "num") == 0)
! Conf->flagMode = FM_NUM;
! else if (STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default, long and num flag value")));
! }
}
pfree(recoded);
***************
*** 695,725 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! if (scanread == 4)
{
! if (strlen(sflag) != 1)
! goto nextline;
! flag = *sflag;
! isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
else
{
char *ptr;
int aflg = 0;
! if (strlen(sflag) != 1 || flag != *sflag || flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
--- 867,941 ----
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
+ *find = *repl = *mask = '\0';
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
+
+ /* First try to parse AF parameter (alias compression) */
+ if (STRNCMP(ptype, "af") == 0)
+ {
+ /* First line is the number of aliases */
+ if (!Conf->useFlagAliases)
+ {
+ Conf->useFlagAliases = true;
+ naffix = atoi(sflag);
+ if (naffix == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid number of flag vector aliases")));
+
+ /* Also reserve place for empty flag set */
+ naffix++;
+
+ Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
+ Conf->lenAffixData = Conf->nAffixData = naffix;
+
+ /* Add empty flag set into AffixData */
+ Conf->AffixData[curaffix] = VoidString;
+ curaffix++;
+ }
+ /* Other lines is aliases */
+ else
+ {
+ if (curaffix < naffix)
+ {
+ Conf->AffixData[curaffix] = cpstrdup(Conf, sflag);
+ curaffix++;
+ }
+ }
+ goto nextline;
+ }
+ /* Else try to parse prefixes and suffixes */
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! sflaglen = strlen(sflag);
! if (sflaglen == 0
! || (sflaglen > 1 && Conf->flagMode == FM_CHAR)
! || (sflaglen > 2 && Conf->flagMode == FM_LONG))
! goto nextline;
! flag = decodeFlag(Conf, sflag, (char **)NULL);
!
! /* Affix header */
! if (flag != flagprev)
{
! flagprev = flag;
! isSuffix = (STRNCMP(ptype, "sfx") == 0);
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
+ /* Affix fields */
else
{
char *ptr;
int aflg = 0;
! if (flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
***************
*** 727,737 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! while (*ptr)
! {
! aflg |= Conf->flagval[*(unsigned char *) ptr];
! ptr++;
! }
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
--- 943,949 ----
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! aflg |= getFlagValues(Conf, getFlags(Conf, ptr));
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
***************
*** 789,794 **** NIImportAffixes(IspellDict *Conf, const char *filename)
--- 1001,1008 ----
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
while ((recoded = tsearch_readline(&trst)) != NULL)
{
***************
*** 931,946 **** MergeAffix(IspellDict *Conf, int a1, int a2)
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! uint32 flag = 0;
! char *str = Conf->AffixData[affix];
!
! while (str && *str)
! {
! flag |= Conf->flagval[*(unsigned char *) str];
! str++;
! }
!
! return (flag & FF_DICTFLAGMASK);
}
static SPNode *
--- 1145,1152 ----
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! char *str = Conf->AffixData[affix];
! return (getFlagValues(Conf, str) & FF_DICTFLAGMASK);
}
static SPNode *
***************
*** 954,960 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
int lownew = low;
for (i = low; i < high; i++)
! if (Conf->Spell[i]->p.d.len > level && lastchar != Conf->Spell[i]->word[level])
{
nchar++;
lastchar = Conf->Spell[i]->word[level];
--- 1160,1166 ----
int lownew = low;
for (i = low; i < high; i++)
! if (Conf->Spell[i]->d.len > level && lastchar != Conf->Spell[i]->word[level])
{
nchar++;
lastchar = Conf->Spell[i]->word[level];
***************
*** 969,975 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
lastchar = '\0';
for (i = low; i < high; i++)
! if (Conf->Spell[i]->p.d.len > level)
{
if (lastchar != Conf->Spell[i]->word[level])
{
--- 1175,1181 ----
lastchar = '\0';
for (i = low; i < high; i++)
! if (Conf->Spell[i]->d.len > level)
{
if (lastchar != Conf->Spell[i]->word[level])
{
***************
*** 982,992 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
lastchar = Conf->Spell[i]->word[level];
}
data->val = ((uint8 *) (Conf->Spell[i]->word))[level];
! if (Conf->Spell[i]->p.d.len == level + 1)
{
bool clearCompoundOnly = false;
! if (data->isword && data->affix != Conf->Spell[i]->p.d.affix)
{
/*
* MergeAffix called a few times. If one of word is
--- 1188,1198 ----
lastchar = Conf->Spell[i]->word[level];
}
data->val = ((uint8 *) (Conf->Spell[i]->word))[level];
! if (Conf->Spell[i]->d.len == level + 1)
{
bool clearCompoundOnly = false;
! if (data->isword && data->affix != Conf->Spell[i]->d.affix)
{
/*
* MergeAffix called a few times. If one of word is
***************
*** 995,1006 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
*/
clearCompoundOnly = (FF_COMPOUNDONLY & data->compoundflag
! & makeCompoundFlags(Conf, Conf->Spell[i]->p.d.affix))
? false : true;
! data->affix = MergeAffix(Conf, data->affix, Conf->Spell[i]->p.d.affix);
}
else
! data->affix = Conf->Spell[i]->p.d.affix;
data->isword = 1;
data->compoundflag = makeCompoundFlags(Conf, data->affix);
--- 1201,1212 ----
*/
clearCompoundOnly = (FF_COMPOUNDONLY & data->compoundflag
! & makeCompoundFlags(Conf, Conf->Spell[i]->d.affix))
? false : true;
! data->affix = MergeAffix(Conf, data->affix, Conf->Spell[i]->d.affix);
}
else
! data->affix = Conf->Spell[i]->d.affix;
data->isword = 1;
data->compoundflag = makeCompoundFlags(Conf, data->affix);
***************
*** 1032,1070 **** NISortDictionary(IspellDict *Conf)
/* compress affixes */
! /* Count the number of different flags used in the dictionary */
!
! qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
!
! naffix = 0;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
! naffix++;
! }
!
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
*/
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
{
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
}
-
- Conf->Spell[i]->p.d.affix = curaffix;
- Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
! Conf->lenAffixData = Conf->nAffixData = naffix;
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
--- 1238,1294 ----
/* compress affixes */
! /* If we use flag aliases then we need to use Conf->AffixData filled in NIImportOOAffixes.
! * If Conf->Spell[i]->flag is empty, then get empty value of Conf->AffixData (0 index)
*/
! if (Conf->useFlagAliases)
{
! for (i = 0; i < Conf->nspell; i++)
{
! curaffix = strtol(Conf->Spell[i]->flag, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! Conf->Spell[i]->d.affix = curaffix;
! else
! Conf->Spell[i]->d.affix = 0;
! Conf->Spell[i]->d.len = strlen(Conf->Spell[i]->word);
}
}
+ /* Otherwise fill Conf->AffixData here */
+ else
+ {
+ /* Count the number of different flags used in the dictionary */
+ qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
+
+ naffix = 0;
+ for (i = 0; i < Conf->nspell; i++)
+ {
+ if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->Spell[i - 1]->flag))
+ naffix++;
+ }
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->AffixData[curaffix]))
! {
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->flag);
! }
!
! Conf->Spell[i]->d.affix = curaffix;
! Conf->Spell[i]->d.len = strlen(Conf->Spell[i]->word);
! }
!
! Conf->lenAffixData = Conf->nAffixData = naffix;
! }
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
***************
*** 1185,1196 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
}
static bool
! isAffixInUse(IspellDict *Conf, char flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (strchr(Conf->AffixData[i], flag) != NULL)
return true;
return false;
--- 1409,1420 ----
}
static bool
! isAffixInUse(IspellDict *Conf, int flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (isAffixFlagInUse(Conf, i, flag))
return true;
return false;
***************
*** 1219,1225 **** NISortAffixes(IspellDict *Conf)
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, (char) Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
--- 1443,1449 ----
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
***************
*** 1230,1236 **** NISortAffixes(IspellDict *Conf)
/* leave only unique and minimals suffixes */
ptr->affix = Affix->repl;
ptr->len = Affix->replen;
! ptr->issuffix = (Affix->type == FF_SUFFIX) ? true : false;
ptr++;
}
}
--- 1454,1460 ----
/* leave only unique and minimals suffixes */
ptr->affix = Affix->repl;
ptr->len = Affix->replen;
! ptr->issuffix = (Affix->type == FF_SUFFIX);
ptr++;
}
}
***************
*** 1685,1691 **** SplitToVariants(IspellDict *Conf, SPNode *snode, SplitVar *orig, char *word, int
if (StopLow < StopHigh)
{
! if (level == FF_COMPOUNDBEGIN)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
--- 1909,1915 ----
if (StopLow < StopHigh)
{
! if (startpos == 0)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
*** a/src/include/tsearch/dicts/spell.h
--- b/src/include/tsearch/dicts/spell.h
***************
*** 57,75 **** typedef struct SPNode
typedef struct spell_struct
{
! union
{
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
! */
! char flag[MAXFLAGLEN];
! struct
! {
! int affix;
! int len;
! } d;
! } p;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;
--- 57,72 ----
typedef struct spell_struct
{
! struct
{
! int affix;
! int len;
! } d;
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is used instead of flag.
! */
! char *flag;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;
***************
*** 77,83 **** typedef struct spell_struct
typedef struct aff_struct
{
! uint32 flag:8,
type:1,
flagflags:7,
issimple:1,
--- 74,80 ----
typedef struct aff_struct
{
! uint32 flag:16,
type:1,
flagflags:7,
issimple:1,
***************
*** 132,137 **** typedef struct
--- 129,141 ----
bool issuffix;
} CMPDAffix;
+ typedef enum
+ {
+ FM_CHAR,
+ FM_LONG,
+ FM_NUM
+ } FlagMode;
+
typedef struct
{
int maffixes;
***************
*** 145,155 **** typedef struct
char **AffixData;
int lenAffixData;
int nAffixData;
CMPDAffix *CompoundAffix;
! unsigned char flagval[256];
bool usecompound;
/*
* Remaining fields are only used during dictionary construction; they are
--- 149,161 ----
char **AffixData;
int lenAffixData;
int nAffixData;
+ bool useFlagAliases;
CMPDAffix *CompoundAffix;
! unsigned char flagval[65000];
bool usecompound;
+ FlagMode flagMode;
/*
* Remaining fields are only used during dictionary construction; they are
Sorry, I don't know why this thread was moved to another thread.
I duplicate the patch here.
On 28.01.2016 14:19, Alvaro Herrera wrote:
Artur Zakirov wrote:
I undo the changes and the error will be raised. I will update the patch
soon.I don't think you ever did this. I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.I'm working on the patch. I wanted to send this changes after all changes.
This version of the patch has a top-level comment. Another comments I will provides soon.
Also this patch has some changes with ternary operators.
I don't think you ever did this. I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.Moved to next CF.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
hunspell_dict_v5.patchtext/x-patch; name=hunspell_dict_v5.patchDownload
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***************
*** 2615,2632 **** SELECT plainto_tsquery('supernova star');
</para>
<para>
! To create an <application>Ispell</> dictionary, use the built-in
! <literal>ispell</literal> template and specify several parameters:
</para>
!
<programlisting>
! CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
</programlisting>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
--- 2615,2655 ----
</para>
<para>
! To create an <application>Ispell</> dictionary perform these steps:
</para>
! <itemizedlist spacing="compact" mark="bullet">
! <listitem>
! <para>
! download dictionary configuration files. <productname>OpenOffice</>
! extension files have the <filename>.oxt</> extension. It is necessary
! to extract <filename>.aff</> and <filename>.dic</> files, change extensions
! to <filename>.affix</> and <filename>.dict</>. For some dictionary
! files it is also needed to convert characters to the UTF-8 encoding
! with commands (for example, for norwegian language dictionary):
<programlisting>
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! </programlisting>
! </para>
! </listitem>
! <listitem>
! <para>
! copy files to the <filename>$SHAREDIR/tsearch_data</> directory
! </para>
! </listitem>
! <listitem>
! <para>
! load files into PostgreSQL with the following command:
! <programlisting>
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
</programlisting>
+ </para>
+ </listitem>
+ </itemizedlist>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
***************
*** 2643,2648 **** CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 ----
</para>
<para>
+ The <filename>.affix</> file of <application>Ispell</> has the following structure:
+ <programlisting>
+ prefixes
+ flag *A:
+ . > RE # As in enter > reenter
+ suffixes
+ flag T:
+ E > ST # As in late > latest
+ [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y > EST # As in gray > grayest
+ [^EY] > EST # As in small > smallest
+ </programlisting>
+ </para>
+ <para>
+ And the <filename>.dict</> file has the following structure:
+ <programlisting>
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ </programlisting>
+ </para>
+
+ <para>
+ Format of the <filename>.dict</> file is:
+ <programlisting>
+ basic_form/affix_class_name
+ </programlisting>
+ </para>
+
+ <para>
+ In the <filename>.affix</> file every affix flag is described in the
+ following format:
+ <programlisting>
+ condition > [-stripping_letters,] adding_affix
+ </programlisting>
+ </para>
+
+ <para>
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings <literal>[...]</> and <literal>[^...]</>.
+ For example, <literal>[AEIOU]Y</> means that the last letter of the word
+ is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
+ <literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
+ <literal>[^EY]</> means that the last letter is neither <literal>"e"</>
+ nor <literal>"y"</>.
+ </para>
+
+ <para>
Ispell dictionaries support splitting compound words;
a useful feature.
Notice that the affix file should specify a special flag using the
***************
*** 2663,2668 **** SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2735,2796 ----
</programlisting>
</para>
+ <para>
+ <application>MySpell</> is very similar to <application>Hunspell</>.
+ The <filename>.affix</> file of <application>Hunspell</> has the following structure:
+ <programlisting>
+ PFX A Y 1
+ PFX A 0 re .
+ SFX T N 4
+ SFX T 0 st e
+ SFX T y iest [^aeiou]y
+ SFX T 0 est [aeiou]y
+ SFX T 0 est [^ey]
+ </programlisting>
+ </para>
+
+ <para>
+ The first line of an affix class is the header. Fields of an affix rules are listed after the header:
+ </para>
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ parameter name (PFX or SFX)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ flag (name of the affix class)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ stripping characters from beginning (at prefix) or end (at suffix) of the word
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ adding affix
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ condition that has a format similar to the format of regular expressions.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ The <filename>.dict</> file looks like the <filename>.dict</> file of
+ <application>Ispell</>:
+ <programlisting>
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ </programlisting>
+ </para>
+
<note>
<para>
<application>MySpell</> does not support compound words.
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 5,10 ****
--- 5,54 ----
*
* Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
*
+ * Ispell dictionary
+ * --------------------------------
+ *
+ * Rules of dictionaries are defined in two files with .affix and .dict
+ * extensions. They are used by spell checker programs Ispell and Hunspell.
+ *
+ * An .affix file declares morphological rules to get a basic form of words.
+ * The format of an .affix file has different structure for Ispell and Hunspell
+ * dictionaries. The Hunspell format is more complicated. But when an .affix
+ * file is imported and compiled, it is stored in the same structure AffixNode.
+ *
+ * A .dict file stores a list of basic forms of words with references to
+ * affix rules. The format of a .dict file has the same structure for Ispell
+ * and Hunspell dictionaries.
+ *
+ * Compilation of a dictionary
+ * ---------------------------
+ *
+ * A compiled dictionary is stored in the IspellDict structure. Compilation of
+ * a dictionary is divided into the several steps:
+ * - NIImportDictionary() - stores each word of a .dict file in the
+ * temporary Spell field.
+ * - NIImportAffixes() - stores affix rules of an .affix file in the
+ * Affix field (not temporary) if an .affix file has the Ispell format.
+ * -> NIImportOOAffixes() - stores affix rules if an .affix file has the
+ * Hunspell format. The AffixData field is initialized if AF parameter
+ * is defined.
+ * - NISortDictionary() - builds a prefix tree (Trie) from the words list
+ * and stores it in the Dictionary field. The AffixData field is initialized
+ * if AF parameter is not defined.
+ * - NISortAffixes():
+ * - builds a list of compond affixes and stores it in the CompoundAffix.
+ * - builds prefix trees (Trie) from the affix list for prefixes and suffixes
+ * and stores them in Suffix and Prefix fields.
+ *
+ * Memory management
+ * -----------------
+ *
+ * The IspellDict structure has the Spell field which is used only in compile
+ * time. The Spell field stores a words list. It can take a lot of memory.
+ * Therefore when a dictionary is compiled this field is cleared by NIFinishBuild.
+ *
+ * All resources which should cleared by NIFinishBuild is initialized using
+ * tmpalloc() and tmpalloc0().
*
* IDENTIFICATION
* src/backend/tsearch/spell.c
***************
*** 153,159 **** cmpspell(const void *s1, const void *s2)
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
}
static char *
--- 197,203 ----
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN));
}
static char *
***************
*** 237,242 **** cmpaffix(const void *s1, const void *s2)
--- 281,353 ----
(const unsigned char *) a2->repl);
}
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ unsigned short s;
+ char *next;
+
+ switch (Conf->flagMode)
+ {
+ case FM_LONG:
+ s = (int)sflag[0] << 8 | (int)sflag[1];
+ if (sflagnext)
+ *sflagnext = sflag + 2;
+ break;
+ case FM_NUM:
+ s = (unsigned short) strtol(sflag, &next, 10);
+ if (sflagnext)
+ {
+ if (next)
+ {
+ *sflagnext = next;
+ while (**sflagnext)
+ {
+ if (**sflagnext == ',')
+ {
+ *sflagnext = *sflagnext + 1;
+ break;
+ }
+ *sflagnext = *sflagnext + 1;
+ }
+ }
+ else
+ *sflagnext = 0;
+ }
+ break;
+ default:
+ s = (unsigned short) *((unsigned char *)sflag);
+ if (sflagnext)
+ *sflagnext = sflag + 1;
+ }
+
+ return s;
+ }
+
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ char *flagcur;
+ char *flagnext = 0;
+
+ if (affixflag == 0)
+ return true;
+
+ flagcur = Conf->AffixData[affix];
+
+ while (*flagcur)
+ {
+ if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ return true;
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return false;
+ }
+
static void
NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
{
***************
*** 255,261 **** NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
Conf->nspell++;
}
--- 366,372 ----
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString;
Conf->nspell++;
}
***************
*** 355,361 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
return 1;
}
node = StopMiddle->node;
--- 466,472 ----
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
return 1;
}
node = StopMiddle->node;
***************
*** 394,400 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0)
{
Affix->issimple = 1;
Affix->isregis = 0;
--- 505,511 ----
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0 || *mask == '\0')
{
Affix->issimple = 1;
Affix->isregis = 0;
***************
*** 403,409 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
{
Affix->issimple = 0;
Affix->isregis = 1;
! RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX) ? true : false,
*mask ? mask : VoidString);
}
else
--- 514,520 ----
{
Affix->issimple = 0;
Affix->isregis = 1;
! RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX),
*mask ? mask : VoidString);
}
else
***************
*** 576,582 **** parse_affentry(char *str, char *mask, char *find, char *repl)
*pmask = *pfind = *prepl = '\0';
! return (*mask && (*find || *repl)) ? true : false;
}
static void
--- 687,693 ----
*pmask = *pfind = *prepl = '\0';
! return (*mask && (*find || *repl));
}
static void
***************
*** 595,604 **** addFlagValue(IspellDict *Conf, char *s, uint32 val)
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
Conf->usecompound = true;
}
/*
* Import an affix file that follows MySpell or Hunspell format
*/
--- 706,763 ----
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
Conf->usecompound = true;
}
+ static int
+ getFlagValues(IspellDict *Conf, char *s)
+ {
+ uint32 flag = 0;
+ char *flagcur;
+ char *flagnext = 0;
+
+ flagcur = s;
+ while (*flagcur)
+ {
+ flag |= Conf->flagval[decodeFlag(Conf, flagcur, &flagnext)];
+ if (flagnext)
+ flagcur = flagnext;
+ else
+ break;
+ }
+
+ return flag;
+ }
+
+ /*
+ * Get flag set from "s".
+ *
+ * Returns flag set from AffixData array if AF parameter used (useFlagAliases is true).
+ * In this case "s" is alias for flag set.
+ *
+ * Otherwise returns "s".
+ */
+ static char *
+ getFlags(IspellDict *Conf, char *s)
+ {
+ int curaffix;
+ if (Conf->useFlagAliases)
+ {
+ curaffix = strtol(s, (char **)NULL, 10);
+ if (curaffix && curaffix <= Conf->nAffixData)
+ /*
+ * Do not substract 1 from curaffix
+ * because empty string was added in NIImportOOAffixes
+ */
+ return Conf->AffixData[curaffix];
+ else
+ return VoidString;
+ }
+ else
+ return s;
+ }
+
/*
* Import an affix file that follows MySpell or Hunspell format
*/
***************
*** 615,621 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int flag = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
--- 774,784 ----
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int naffix = 0,
! curaffix = 0;
! int flag = 0,
! flagprev = 0,
! sflaglen = 0;
char flagflags = 0;
tsearch_readline_state trst;
int scanread = 0;
***************
*** 625,630 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 788,795 ----
/* read file to find any flag */
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
if (!tsearch_readline_begin(&trst, filename))
ereport(ERROR,
***************
*** 672,681 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s && STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default flag value")));
}
pfree(recoded);
--- 837,853 ----
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s)
! {
! if (STRNCMP(s, "long") == 0)
! Conf->flagMode = FM_LONG;
! else if (STRNCMP(s, "num") == 0)
! Conf->flagMode = FM_NUM;
! else if (STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default, long and num flag value")));
! }
}
pfree(recoded);
***************
*** 695,725 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! if (scanread == 4)
{
! if (strlen(sflag) != 1)
! goto nextline;
! flag = *sflag;
! isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
else
{
char *ptr;
int aflg = 0;
! if (strlen(sflag) != 1 || flag != *sflag || flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
--- 867,941 ----
if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
goto nextline;
+ *find = *repl = *mask = '\0';
scanread = sscanf(recoded, scanbuf, type, sflag, find, repl, mask);
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
+
+ /* First try to parse AF parameter (alias compression) */
+ if (STRNCMP(ptype, "af") == 0)
+ {
+ /* First line is the number of aliases */
+ if (!Conf->useFlagAliases)
+ {
+ Conf->useFlagAliases = true;
+ naffix = atoi(sflag);
+ if (naffix == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid number of flag vector aliases")));
+
+ /* Also reserve place for empty flag set */
+ naffix++;
+
+ Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
+ Conf->lenAffixData = Conf->nAffixData = naffix;
+
+ /* Add empty flag set into AffixData */
+ Conf->AffixData[curaffix] = VoidString;
+ curaffix++;
+ }
+ /* Other lines is aliases */
+ else
+ {
+ if (curaffix < naffix)
+ {
+ Conf->AffixData[curaffix] = cpstrdup(Conf, sflag);
+ curaffix++;
+ }
+ }
+ goto nextline;
+ }
+ /* Else try to parse prefixes and suffixes */
if (scanread < 4 || (STRNCMP(ptype, "sfx") && STRNCMP(ptype, "pfx")))
goto nextline;
! sflaglen = strlen(sflag);
! if (sflaglen == 0
! || (sflaglen > 1 && Conf->flagMode == FM_CHAR)
! || (sflaglen > 2 && Conf->flagMode == FM_LONG))
! goto nextline;
! flag = decodeFlag(Conf, sflag, (char **)NULL);
!
! /* Affix header */
! if (flag != flagprev)
{
! flagprev = flag;
! isSuffix = (STRNCMP(ptype, "sfx") == 0);
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
+ /* Affix fields */
else
{
char *ptr;
int aflg = 0;
! if (flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* affix flag */
***************
*** 727,737 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! while (*ptr)
! {
! aflg |= Conf->flagval[*(unsigned char *) ptr];
! ptr++;
! }
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
--- 943,949 ----
{
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! aflg |= getFlagValues(Conf, getFlags(Conf, ptr));
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
***************
*** 789,794 **** NIImportAffixes(IspellDict *Conf, const char *filename)
--- 1001,1008 ----
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
while ((recoded = tsearch_readline(&trst)) != NULL)
{
***************
*** 931,946 **** MergeAffix(IspellDict *Conf, int a1, int a2)
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! uint32 flag = 0;
! char *str = Conf->AffixData[affix];
!
! while (str && *str)
! {
! flag |= Conf->flagval[*(unsigned char *) str];
! str++;
! }
!
! return (flag & FF_DICTFLAGMASK);
}
static SPNode *
--- 1145,1152 ----
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! char *str = Conf->AffixData[affix];
! return (getFlagValues(Conf, str) & FF_DICTFLAGMASK);
}
static SPNode *
***************
*** 954,960 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
int lownew = low;
for (i = low; i < high; i++)
! if (Conf->Spell[i]->p.d.len > level && lastchar != Conf->Spell[i]->word[level])
{
nchar++;
lastchar = Conf->Spell[i]->word[level];
--- 1160,1166 ----
int lownew = low;
for (i = low; i < high; i++)
! if (Conf->Spell[i]->d.len > level && lastchar != Conf->Spell[i]->word[level])
{
nchar++;
lastchar = Conf->Spell[i]->word[level];
***************
*** 969,975 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
lastchar = '\0';
for (i = low; i < high; i++)
! if (Conf->Spell[i]->p.d.len > level)
{
if (lastchar != Conf->Spell[i]->word[level])
{
--- 1175,1181 ----
lastchar = '\0';
for (i = low; i < high; i++)
! if (Conf->Spell[i]->d.len > level)
{
if (lastchar != Conf->Spell[i]->word[level])
{
***************
*** 982,992 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
lastchar = Conf->Spell[i]->word[level];
}
data->val = ((uint8 *) (Conf->Spell[i]->word))[level];
! if (Conf->Spell[i]->p.d.len == level + 1)
{
bool clearCompoundOnly = false;
! if (data->isword && data->affix != Conf->Spell[i]->p.d.affix)
{
/*
* MergeAffix called a few times. If one of word is
--- 1188,1198 ----
lastchar = Conf->Spell[i]->word[level];
}
data->val = ((uint8 *) (Conf->Spell[i]->word))[level];
! if (Conf->Spell[i]->d.len == level + 1)
{
bool clearCompoundOnly = false;
! if (data->isword && data->affix != Conf->Spell[i]->d.affix)
{
/*
* MergeAffix called a few times. If one of word is
***************
*** 995,1006 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
*/
clearCompoundOnly = (FF_COMPOUNDONLY & data->compoundflag
! & makeCompoundFlags(Conf, Conf->Spell[i]->p.d.affix))
? false : true;
! data->affix = MergeAffix(Conf, data->affix, Conf->Spell[i]->p.d.affix);
}
else
! data->affix = Conf->Spell[i]->p.d.affix;
data->isword = 1;
data->compoundflag = makeCompoundFlags(Conf, data->affix);
--- 1201,1212 ----
*/
clearCompoundOnly = (FF_COMPOUNDONLY & data->compoundflag
! & makeCompoundFlags(Conf, Conf->Spell[i]->d.affix))
? false : true;
! data->affix = MergeAffix(Conf, data->affix, Conf->Spell[i]->d.affix);
}
else
! data->affix = Conf->Spell[i]->d.affix;
data->isword = 1;
data->compoundflag = makeCompoundFlags(Conf, data->affix);
***************
*** 1032,1070 **** NISortDictionary(IspellDict *Conf)
/* compress affixes */
! /* Count the number of different flags used in the dictionary */
!
! qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
!
! naffix = 0;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
! naffix++;
! }
!
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
*/
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
{
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
}
-
- Conf->Spell[i]->p.d.affix = curaffix;
- Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
! Conf->lenAffixData = Conf->nAffixData = naffix;
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
--- 1238,1294 ----
/* compress affixes */
! /* If we use flag aliases then we need to use Conf->AffixData filled in NIImportOOAffixes.
! * If Conf->Spell[i]->flag is empty, then get empty value of Conf->AffixData (0 index)
*/
! if (Conf->useFlagAliases)
{
! for (i = 0; i < Conf->nspell; i++)
{
! curaffix = strtol(Conf->Spell[i]->flag, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! Conf->Spell[i]->d.affix = curaffix;
! else
! Conf->Spell[i]->d.affix = 0;
! Conf->Spell[i]->d.len = strlen(Conf->Spell[i]->word);
}
}
+ /* Otherwise fill Conf->AffixData here */
+ else
+ {
+ /* Count the number of different flags used in the dictionary */
+ qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
+
+ naffix = 0;
+ for (i = 0; i < Conf->nspell; i++)
+ {
+ if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->Spell[i - 1]->flag))
+ naffix++;
+ }
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strcmp(Conf->Spell[i]->flag, Conf->AffixData[curaffix]))
! {
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->flag);
! }
!
! Conf->Spell[i]->d.affix = curaffix;
! Conf->Spell[i]->d.len = strlen(Conf->Spell[i]->word);
! }
!
! Conf->lenAffixData = Conf->nAffixData = naffix;
! }
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
***************
*** 1185,1196 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
}
static bool
! isAffixInUse(IspellDict *Conf, char flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (strchr(Conf->AffixData[i], flag) != NULL)
return true;
return false;
--- 1409,1420 ----
}
static bool
! isAffixInUse(IspellDict *Conf, int flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (isAffixFlagInUse(Conf, i, flag))
return true;
return false;
***************
*** 1219,1225 **** NISortAffixes(IspellDict *Conf)
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, (char) Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
--- 1443,1449 ----
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
***************
*** 1230,1236 **** NISortAffixes(IspellDict *Conf)
/* leave only unique and minimals suffixes */
ptr->affix = Affix->repl;
ptr->len = Affix->replen;
! ptr->issuffix = (Affix->type == FF_SUFFIX) ? true : false;
ptr++;
}
}
--- 1454,1460 ----
/* leave only unique and minimals suffixes */
ptr->affix = Affix->repl;
ptr->len = Affix->replen;
! ptr->issuffix = (Affix->type == FF_SUFFIX);
ptr++;
}
}
***************
*** 1685,1691 **** SplitToVariants(IspellDict *Conf, SPNode *snode, SplitVar *orig, char *word, int
if (StopLow < StopHigh)
{
! if (level == FF_COMPOUNDBEGIN)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
--- 1909,1915 ----
if (StopLow < StopHigh)
{
! if (startpos == 0)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
*** a/src/include/tsearch/dicts/spell.h
--- b/src/include/tsearch/dicts/spell.h
***************
*** 57,75 **** typedef struct SPNode
typedef struct spell_struct
{
! union
{
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
! */
! char flag[MAXFLAGLEN];
! struct
! {
! int affix;
! int len;
! } d;
! } p;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;
--- 57,72 ----
typedef struct spell_struct
{
! struct
{
! int affix;
! int len;
! } d;
! /*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is used instead of flag.
! */
! char *flag;
char word[FLEXIBLE_ARRAY_MEMBER];
} SPELL;
***************
*** 77,83 **** typedef struct spell_struct
typedef struct aff_struct
{
! uint32 flag:8,
type:1,
flagflags:7,
issimple:1,
--- 74,80 ----
typedef struct aff_struct
{
! uint32 flag:16,
type:1,
flagflags:7,
issimple:1,
***************
*** 132,137 **** typedef struct
--- 129,141 ----
bool issuffix;
} CMPDAffix;
+ typedef enum
+ {
+ FM_CHAR,
+ FM_LONG,
+ FM_NUM
+ } FlagMode;
+
typedef struct
{
int maffixes;
***************
*** 145,155 **** typedef struct
char **AffixData;
int lenAffixData;
int nAffixData;
CMPDAffix *CompoundAffix;
! unsigned char flagval[256];
bool usecompound;
/*
* Remaining fields are only used during dictionary construction; they are
--- 149,161 ----
char **AffixData;
int lenAffixData;
int nAffixData;
+ bool useFlagAliases;
CMPDAffix *CompoundAffix;
! unsigned char flagval[65000];
bool usecompound;
+ FlagMode flagMode;
/*
* Remaining fields are only used during dictionary construction; they are
I duplicate the patch here.
it's very good thing to update disctionaries to support modern versions. And
thank you for improving documentation. Also I've impressed by long description
in spell.c header.
Som notices about code:
1
struct SPELL. Why do you remove union p? You leave comment
about using d struct instead of flag field and as can see
it's right comment. It increases size of SPELL structure.
2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields should be
less or equal to size of integer. In opposite case, suppose, we can get
undefined behavior. Please, split bitfields to two integers.
3 unsigned char flagval[65000];
Is it forbidden to use 65555 number? In any case, decodeFlag() doesn't
restrict return value. I suggest to enlarge array to 1<<16 and add limit
to return value of decodeFlag().
4
I'd like to see a short comment describing at least new functions
5
Pls, add tests for new code.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for the review.
On 10.02.2016 19:46, Teodor Sigaev wrote:
I duplicate the patch here.
it's very good thing to update disctionaries to support modern versions.
And thank you for improving documentation. Also I've impressed by long
description in spell.c header.Som notices about code:
1
struct SPELL. Why do you remove union p? You leave comment
about using d struct instead of flag field and as can see
it's right comment. It increases size of SPELL structure.
I will fix it. I had misunderstood the Alvaro's comment about it.
2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields
should be less or equal to size of integer. In opposite case, suppose,
we can get undefined behavior. Please, split bitfields to two integers.
I will fix it. Here I had misunderstood too.
3 unsigned char flagval[65000];
Is it forbidden to use 65555 number? In any case, decodeFlag() doesn't
restrict return value. I suggest to enlarge array to 1<<16 and add limit
to return value of decodeFlag().
I think it can be done.
4
I'd like to see a short comment describing at least new functions
Now in spell.c there are more comments. I wanted to send fixed patch
after adding all comments that I want to add. But I can send the patch now.
Also I will merge this commit
/messages/by-id/E1aTf9o-0001ga-LG@gemulon.postgresql.org
5
Pls, add tests for new code.
I will add.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I attached a new version of the patch.
On 10.02.2016 19:46, Teodor Sigaev wrote:
I duplicate the patch here.
it's very good thing to update disctionaries to support modern versions.
And thank you for improving documentation. Also I've impressed by long
description in spell.c header.Som notices about code:
1
struct SPELL. Why do you remove union p? You leave comment
about using d struct instead of flag field and as can see
it's right comment. It increases size of SPELL structure.
Fixed.
2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields
should be less or equal to size of integer. In opposite case, suppose,
we can get undefined behavior. Please, split bitfields to two integers.
Fixed.
3 unsigned char flagval[65000];
Is it forbidden to use 65555 number? In any case, decodeFlag() doesn't
restrict return value. I suggest to enlarge array to 1<<16 and add limit
to return value of decodeFlag().
flagval array was enlarged. Added verification of return value of
DecodeFlag() for for various FLAG parameter (FM_LONG, FM_NUM and FM_CHAR).
4
I'd like to see a short comment describing at least new functions
Added some comments which describe new functions and old functions for
loading dictionaries into PostgreSQL. This patch adds new functions and
modifies functions which is used for loading dictionaries.
At the moment, comments does not describe functions which used for word
normalization. But I can add more comments.
5
Pls, add tests for new code.
Added tests. Old sample dictionaries files was moved to the folder
"dicts". New sample dictionaries files was added:
- hunspell_sample_long.affix
- hunspell_sample_long.dict
- hunspell_sample_num.affix
- hunspell_sample_num.dict
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
hunspell_dict_v6.patchtext/x-patch; name=hunspell_dict_v6.patchDownload
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***************
*** 2615,2632 **** SELECT plainto_tsquery('supernova star');
</para>
<para>
! To create an <application>Ispell</> dictionary, use the built-in
! <literal>ispell</literal> template and specify several parameters:
</para>
!
<programlisting>
! CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
</programlisting>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
--- 2615,2655 ----
</para>
<para>
! To create an <application>Ispell</> dictionary perform these steps:
</para>
! <itemizedlist spacing="compact" mark="bullet">
! <listitem>
! <para>
! download dictionary configuration files. <productname>OpenOffice</>
! extension files have the <filename>.oxt</> extension. It is necessary
! to extract <filename>.aff</> and <filename>.dic</> files, change extensions
! to <filename>.affix</> and <filename>.dict</>. For some dictionary
! files it is also needed to convert characters to the UTF-8 encoding
! with commands (for example, for norwegian language dictionary):
<programlisting>
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! </programlisting>
! </para>
! </listitem>
! <listitem>
! <para>
! copy files to the <filename>$SHAREDIR/tsearch_data</> directory
! </para>
! </listitem>
! <listitem>
! <para>
! load files into PostgreSQL with the following command:
! <programlisting>
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
</programlisting>
+ </para>
+ </listitem>
+ </itemizedlist>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
***************
*** 2643,2648 **** CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 ----
</para>
<para>
+ The <filename>.affix</> file of <application>Ispell</> has the following structure:
+ <programlisting>
+ prefixes
+ flag *A:
+ . > RE # As in enter > reenter
+ suffixes
+ flag T:
+ E > ST # As in late > latest
+ [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y > EST # As in gray > grayest
+ [^EY] > EST # As in small > smallest
+ </programlisting>
+ </para>
+ <para>
+ And the <filename>.dict</> file has the following structure:
+ <programlisting>
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ </programlisting>
+ </para>
+
+ <para>
+ Format of the <filename>.dict</> file is:
+ <programlisting>
+ basic_form/affix_class_name
+ </programlisting>
+ </para>
+
+ <para>
+ In the <filename>.affix</> file every affix flag is described in the
+ following format:
+ <programlisting>
+ condition > [-stripping_letters,] adding_affix
+ </programlisting>
+ </para>
+
+ <para>
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings <literal>[...]</> and <literal>[^...]</>.
+ For example, <literal>[AEIOU]Y</> means that the last letter of the word
+ is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
+ <literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
+ <literal>[^EY]</> means that the last letter is neither <literal>"e"</>
+ nor <literal>"y"</>.
+ </para>
+
+ <para>
Ispell dictionaries support splitting compound words;
a useful feature.
Notice that the affix file should specify a special flag using the
***************
*** 2663,2668 **** SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2735,2796 ----
</programlisting>
</para>
+ <para>
+ <application>MySpell</> is very similar to <application>Hunspell</>.
+ The <filename>.affix</> file of <application>Hunspell</> has the following structure:
+ <programlisting>
+ PFX A Y 1
+ PFX A 0 re .
+ SFX T N 4
+ SFX T 0 st e
+ SFX T y iest [^aeiou]y
+ SFX T 0 est [aeiou]y
+ SFX T 0 est [^ey]
+ </programlisting>
+ </para>
+
+ <para>
+ The first line of an affix class is the header. Fields of an affix rules are listed after the header:
+ </para>
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ parameter name (PFX or SFX)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ flag (name of the affix class)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ stripping characters from beginning (at prefix) or end (at suffix) of the word
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ adding affix
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ condition that has a format similar to the format of regular expressions.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ The <filename>.dict</> file looks like the <filename>.dict</> file of
+ <application>Ispell</>:
+ <programlisting>
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ </programlisting>
+ </para>
+
<note>
<para>
<application>MySpell</> does not support compound words.
*** a/src/backend/tsearch/Makefile
--- b/src/backend/tsearch/Makefile
***************
*** 13,20 **** include $(top_builddir)/src/Makefile.global
DICTDIR=tsearch_data
! DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \
! ispell_sample.affix ispell_sample.dict
OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
--- 13,22 ----
DICTDIR=tsearch_data
! DICTFILES=dicts/synonym_sample.syn dicts/thesaurus_sample.ths \
! dicts/hunspell_sample.affix dicts/ispell_sample.affix dicts/ispell_sample.dict \
! dicts/hunspell_sample_long.affix dicts/hunspell_sample_long.dict \
! dicts/hunspell_sample_num.affix dicts/hunspell_sample_num.dict
OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample.affix
***************
*** 0 ****
--- 1,24 ----
+ COMPOUNDFLAG Z
+ ONLYINCOMPOUND L
+
+ PFX B Y 1
+ PFX B 0 re .
+
+ PFX U N 1
+ PFX U 0 un .
+
+ SFX J Y 1
+ SFX J 0 INGS [^E]
+
+ SFX G Y 1
+ SFX G 0 ING [^E]
+
+ SFX S Y 1
+ SFX S 0 S [^SXZHY]
+
+ SFX A Y 1
+ SFX A Y IES [^AEIOU]Y
+
+ SFX \ N 1
+ SFX \ 0 Y/L [^Y]
+
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_long.affix
***************
*** 0 ****
--- 1,35 ----
+ FLAG long
+
+ AF 7
+ AF cZ #1
+ AF cL #2
+ AF sGsJpUsS #3
+ AF sSpB #4
+ AF cZsS #5
+ AF sScZs\ #6
+ AF sA #7
+
+ COMPOUNDFLAG cZ
+ ONLYINCOMPOUND cL
+
+ PFX pB Y 1
+ PFX pB 0 re .
+
+ PFX pU N 1
+ PFX pU 0 un .
+
+ SFX sJ Y 1
+ SFX sJ 0 INGS [^E]
+
+ SFX sG Y 1
+ SFX sG 0 ING [^E]
+
+ SFX sS Y 1
+ SFX sS 0 S [^SXZHY]
+
+ SFX sA Y 1
+ SFX sA Y IES [^AEIOU]Y
+
+ SFX s\ N 1
+ SFX s\ 0 Y/2 [^Y]
+
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_long.dict
***************
*** 0 ****
--- 1,8 ----
+ book/3
+ booking/4
+ footballklubber
+ foot/5
+ football/1
+ ball/6
+ klubber/1
+ sky/7
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_num.affix
***************
*** 0 ****
--- 1,26 ----
+ FLAG num
+
+ COMPOUNDFLAG 101
+ ONLYINCOMPOUND 102
+
+ PFX 201 Y 1
+ PFX 201 0 re .
+
+ PFX 202 N 1
+ PFX 202 0 un .
+
+ SFX 301 Y 1
+ SFX 301 0 INGS [^E]
+
+ SFX 302 Y 1
+ SFX 302 0 ING [^E]
+
+ SFX 303 Y 1
+ SFX 303 0 S [^SXZHY]
+
+ SFX 304 Y 1
+ SFX 304 Y IES [^AEIOU]Y
+
+ SFX 305 N 1
+ SFX 305 0 Y/102 [^Y]
+
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_num.dict
***************
*** 0 ****
--- 1,8 ----
+ book/302,301,202,303
+ booking/303,201
+ footballklubber
+ foot/101,303
+ football/101
+ ball/303,101,305
+ klubber/101
+ sky/304
*** /dev/null
--- b/src/backend/tsearch/dicts/ispell_sample.affix
***************
*** 0 ****
--- 1,26 ----
+ compoundwords controlled Z
+
+ prefixes
+
+ flag *B:
+ . > RE # As in enter > reenter
+
+ flag U:
+ . > UN # As in natural > unnatural
+
+ suffixes
+
+ flag *J:
+ [^E] > INGS # As in cross > crossings
+
+ flag *G:
+ [^E] > ING # As in cross > crossing
+
+ flag *S:
+ [^SXZHY] > S # As in bat > bats
+
+ flag *A:
+ [^AEIOU]Y > -Y,IES # As in imply > implies
+
+ flag ~\\:
+ [^Y] > Y #~ advarsel > advarsely-
*** /dev/null
--- b/src/backend/tsearch/dicts/ispell_sample.dict
***************
*** 0 ****
--- 1,8 ----
+ book/GJUS
+ booking/SB
+ footballklubber
+ foot/ZS
+ football/Z
+ ball/SZ\
+ klubber/Z
+ sky/A
*** /dev/null
--- b/src/backend/tsearch/dicts/synonym_sample.syn
***************
*** 0 ****
--- 1,5 ----
+ postgres pgsql
+ postgresql pgsql
+ postgre pgsql
+ gogle googl
+ indices index*
*** /dev/null
--- b/src/backend/tsearch/dicts/thesaurus_sample.ths
***************
*** 0 ****
--- 1,17 ----
+ #
+ # Theasurus config file. Character ':' separates string from replacement, eg
+ # sample-words : substitute-words
+ #
+ # Any substitute-word can be marked by preceding '*' character,
+ # which means do not lexize this word
+ # Docs: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Thesaurus_dictionary
+
+ one two three : *123
+ one two : *12
+ one : *1
+ two : *2
+
+ supernovae stars : *sn
+ supernovae : *sn
+ booking tickets : order invitation cards
+ booking ? tickets : order invitation Cards
*** a/src/backend/tsearch/hunspell_sample.affix
--- /dev/null
***************
*** 1,24 ****
- COMPOUNDFLAG Z
- ONLYINCOMPOUND L
-
- PFX B Y 1
- PFX B 0 re .
-
- PFX U N 1
- PFX U 0 un .
-
- SFX J Y 1
- SFX J 0 INGS [^E]
-
- SFX G Y 1
- SFX G 0 ING [^E]
-
- SFX S Y 1
- SFX S 0 S [^SXZHY]
-
- SFX A Y 1
- SFX A Y IES [^AEIOU]Y
-
- SFX \ N 1
- SFX \ 0 Y/L [^Y]
-
--- 0 ----
*** a/src/backend/tsearch/ispell_sample.affix
--- /dev/null
***************
*** 1,26 ****
- compoundwords controlled Z
-
- prefixes
-
- flag *B:
- . > RE # As in enter > reenter
-
- flag U:
- . > UN # As in natural > unnatural
-
- suffixes
-
- flag *J:
- [^E] > INGS # As in cross > crossings
-
- flag *G:
- [^E] > ING # As in cross > crossing
-
- flag *S:
- [^SXZHY] > S # As in bat > bats
-
- flag *A:
- [^AEIOU]Y > -Y,IES # As in imply > implies
-
- flag ~\\:
- [^Y] > Y #~ advarsel > advarsely-
--- 0 ----
*** a/src/backend/tsearch/ispell_sample.dict
--- /dev/null
***************
*** 1,8 ****
- book/GJUS
- booking/SB
- footballklubber
- foot/ZS
- football/Z
- ball/SZ\
- klubber/Z
- sky/A
--- 0 ----
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 5,10 ****
--- 5,56 ----
*
* Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
*
+ * Ispell dictionary
+ * --------------------------------
+ *
+ * Rules of dictionaries are defined in two files with .affix and .dict
+ * extensions. They are used by spell checker programs Ispell and Hunspell.
+ *
+ * An .affix file declares morphological rules to get a basic form of words.
+ * The format of an .affix file has different structure for Ispell and Hunspell
+ * dictionaries. The Hunspell format is more complicated. But when an .affix
+ * file is imported and compiled, it is stored in the same structure AffixNode.
+ *
+ * A .dict file stores a list of basic forms of words with references to
+ * affix rules. The format of a .dict file has the same structure for Ispell
+ * and Hunspell dictionaries.
+ *
+ * Compilation of a dictionary
+ * ---------------------------
+ *
+ * A compiled dictionary is stored in the IspellDict structure. Compilation of
+ * a dictionary is divided into the several steps:
+ * - NIImportDictionary() - stores each word of a .dict file in the
+ * temporary Spell field.
+ * - NIImportAffixes() - stores affix rules of an .affix file in the
+ * Affix field (not temporary) if an .affix file has the Ispell format.
+ * -> NIImportOOAffixes() - stores affix rules if an .affix file has the
+ * Hunspell format. The AffixData field is initialized if AF parameter
+ * is defined.
+ * - NISortDictionary() - builds a prefix tree (Trie) from the words list
+ * and stores it in the Dictionary field. The words list is got from the
+ * Spell field. The AffixData field is initialized if AF parameter is not defined.
+ * - NISortAffixes():
+ * - builds a list of compond affixes from the affix list and stores it
+ * in the CompoundAffix.
+ * - builds prefix trees (Trie) from the affix list for prefixes and suffixes
+ * and stores them in Suffix and Prefix fields.
+ * The affix list is got from the Affix field.
+ *
+ * Memory management
+ * -----------------
+ *
+ * The IspellDict structure has the Spell field which is used only in compile
+ * time. The Spell field stores a words list. It can take a lot of memory.
+ * Therefore when a dictionary is compiled this field is cleared by NIFinishBuild().
+ *
+ * All resources which should cleared by NIFinishBuild() is initialized using
+ * tmpalloc() and tmpalloc0().
*
* IDENTIFICATION
* src/backend/tsearch/spell.c
***************
*** 153,159 **** cmpspell(const void *s1, const void *s2)
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
}
static char *
--- 199,205 ----
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strcmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag));
}
static char *
***************
*** 220,225 **** strbncmp(const unsigned char *s1, const unsigned char *s2, size_t count)
--- 266,276 ----
return 0;
}
+ /*
+ * Compares affixes.
+ * First compares the type of an affix. Prefixes should go before affixes.
+ * If types are equal then compares replaceable string.
+ */
static int
cmpaffix(const void *s1, const void *s2)
{
***************
*** 237,242 **** cmpaffix(const void *s1, const void *s2)
--- 288,426 ----
(const unsigned char *) a2->repl);
}
+ /*
+ * Gets an affix flag from string representation (a set of affixes).
+ *
+ * Several flags can be stored in a single string. Flags can be represented by:
+ * - 1 character (FM_CHAR).
+ * - 2 characters (FM_LONG).
+ * - numbers from 1 to 65000 (FM_NUM).
+ *
+ * Depending on the flagMode an affix string can have the following format:
+ * - FM_CHAR: ABCD
+ * Here we have 4 flags: A, B, C and D
+ * - FM_LONG: ABCDE*
+ * Here we have 3 flags: AB, CD and E*
+ * - FM_NUM: 200,205,50
+ * Here we have 3 flags: 200, 205 and 50
+ *
+ * Conf: current dictionary.
+ * sflag: string representation (a set of affixes) of an affix flag.
+ * sflagnext: returns reference to the start of a next affix flag in the sflag.
+ *
+ * Returns an integer representation of the affix flag.
+ */
+ static unsigned short
+ DecodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ unsigned short s;
+ char *next;
+
+ switch (Conf->flagMode)
+ {
+ case FM_LONG:
+ if ((int)sflag[0] > FLAGCHAR_MAXSIZE || (int)sflag[1] > FLAGCHAR_MAXSIZE)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid affix flag \"%s\"", sflag)));
+
+ s = (int)sflag[0] << 8 | (int)sflag[1];
+ if (sflagnext)
+ /* Go to start of the next flag */
+ *sflagnext = sflag + pg_mblen(sflag) * 2;
+ break;
+ case FM_NUM:
+ s = (unsigned short) strtol(sflag, &next, 10);
+ if (s > FLAGNUM_MAXSIZE)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid affix flag \"%s\"", sflag)));
+
+ if (sflagnext)
+ {
+ /* Go to start of the next flag */
+ if (next)
+ {
+ *sflagnext = next;
+ while (**sflagnext)
+ {
+ if (**sflagnext == ',')
+ {
+ /* Found start of the next flag */
+ *sflagnext += pg_mblen(*sflagnext);
+ break;
+ }
+ *sflagnext += pg_mblen(*sflagnext);
+ }
+ }
+ else
+ *sflagnext = 0;
+ }
+ break;
+ default:
+ s = (unsigned short) *((unsigned char *)sflag);
+ if (s > FLAGCHAR_MAXSIZE)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid affix flag \"%s\"", sflag)));
+
+ if (sflagnext)
+ /* Go to start of the next flag */
+ *sflagnext = sflag + pg_mblen(sflag);
+ }
+
+ return s;
+ }
+
+ /*
+ * Checks if the affix set Conf->AffixData[affix] contains affixflag.
+ * Conf->AffixData[affix] is the string representation of an affix flags.
+ * Conf->AffixData[affix] does not contain affixflag if this flag is not used
+ * actually by the .dict file.
+ *
+ * Conf: current dictionary.
+ * affix: index of the Conf->AffixData array.
+ * affixflag: integer representation of the affix flag.
+ *
+ * Returns true if the string Conf->AffixData[affix] contains affixflag,
+ * otherwise returns false.
+ */
+ static bool
+ IsAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ char *flagcur;
+ char *flagnext = 0;
+
+ if (affixflag == 0)
+ return true;
+
+ flagcur = Conf->AffixData[affix];
+
+ while (*flagcur)
+ {
+ /* Compare first affix flag in flagcur with affixflag */
+ if (DecodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ return true;
+ /* Otherwise go to next flag */
+ if (flagnext)
+ flagcur = flagnext;
+ /* If we have not flags anymore then exit */
+ else
+ break;
+ }
+
+ /* Could not find affixflag */
+ return false;
+ }
+
+ /*
+ * Adds the new word into the temporary array Spell.
+ *
+ * Conf: current dictionary.
+ * word: new word.
+ * flag: set of affix flags. Integer representation of flag can be got by
+ * DecodeFlag().
+ */
static void
NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
{
***************
*** 255,268 **** NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
Conf->nspell++;
}
/*
! * import dictionary
*
! * Note caller must already have applied get_tsearch_config_filename
*/
void
NIImportDictionary(IspellDict *Conf, const char *filename)
--- 439,455 ----
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString;
Conf->nspell++;
}
/*
! * Imports dictionary into the temporary array Spell.
*
! * Note caller must already have applied get_tsearch_config_filename.
! *
! * Conf: current dictionary.
! * filename: path to the .dict file.
*/
void
NIImportDictionary(IspellDict *Conf, const char *filename)
***************
*** 280,285 **** NIImportDictionary(IspellDict *Conf, const char *filename)
--- 467,473 ----
{
char *s,
*pstr;
+ /* Set of affix flags */
const char *flag;
/* Extract flag from the line */
***************
*** 324,330 **** NIImportDictionary(IspellDict *Conf, const char *filename)
tsearch_readline_end(&trst);
}
!
static int
FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
{
--- 512,541 ----
tsearch_readline_end(&trst);
}
! /*
! * Searches a basic form of word in the prefix tree. This word was generated
! * using an affix rule. This rule may not be presented in an affix set of
! * a basic form of word.
! *
! * For example, we have the entry in the .dict file:
! * meter/GMD
! *
! * The affix rule with the flag S:
! * SFX S y ies [^aeiou]y
! * is not presented here.
! *
! * The affix rule with the flag M:
! * SFX M 0 's .
! * is presented here.
! *
! * Conf: current dictionary.
! * word: basic form of word.
! * affixflag: integer representation of the affix flag, by which a basic form of
! * word was generated.
! * flag: compound flag used to compare with StopMiddle->compoundflag.
! *
! * Returns 1 if the word was found in the prefix tree, else returns 0.
! */
static int
FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
{
***************
*** 349,361 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
{
if (flag == 0)
{
if (StopMiddle->compoundflag & FF_COMPOUNDONLY)
return 0;
}
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
return 1;
}
node = StopMiddle->node;
--- 560,581 ----
{
if (flag == 0)
{
+ /*
+ * The word can be formed only with another word.
+ * And in the flag parameter there is not a sign
+ * that we search compound words.
+ */
if (StopMiddle->compoundflag & FF_COMPOUNDONLY)
return 0;
}
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! /*
! * Check if this affix rule is presented in the affix set
! * with index StopMiddle->affix.
! */
! if (IsAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
return 1;
}
node = StopMiddle->node;
***************
*** 373,378 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
--- 593,616 ----
return 0;
}
+ /*
+ * Adds a new affix rule to the Affix field.
+ *
+ * Conf: current dictionary.
+ * flag: integer representation of the affix flag ('\' in the below example).
+ * flagflags: set of flags from the flagval field for this affix rule. This set
+ * is listed after '/' character in the added string (repl).
+ *
+ * For example L flag in the hunspell_sample.affix:
+ * SFX \ 0 Y/L [^Y]
+ *
+ * mask: condition for search ('[^Y]' in the above example).
+ * find: stripping characters from beginning (at prefix) or end (at suffix)
+ * of the word ('0' in the above example, 0 means that there is not
+ * stripping character).
+ * repl: adding string after stripping ('Y' in the above example).
+ * type: FF_SUFFIX or FF_PREFIX.
+ */
static void
NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const char *find, const char *repl, int type)
{
***************
*** 394,411 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0)
{
Affix->issimple = 1;
Affix->isregis = 0;
}
else if (RS_isRegis(mask))
{
Affix->issimple = 0;
Affix->isregis = 1;
! RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX) ? true : false,
*mask ? mask : VoidString);
}
else
{
int masklen;
--- 632,652 ----
Affix = Conf->Affix + Conf->naffixes;
! /* This affix rule can be applied for words with any ending */
! if (strcmp(mask, ".") == 0 || *mask == '\0')
{
Affix->issimple = 1;
Affix->isregis = 0;
}
+ /* This affix rule will use regis to search word ending */
else if (RS_isRegis(mask))
{
Affix->issimple = 0;
Affix->isregis = 1;
! RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX),
*mask ? mask : VoidString);
}
+ /* This affix rule will use regex_t to search word ending */
else
{
int masklen;
***************
*** 457,463 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Conf->naffixes++;
}
-
/* Parsing states for parse_affentry() and friends */
#define PAE_WAIT_MASK 0
#define PAE_INMASK 1
--- 698,703 ----
***************
*** 712,720 **** parse_affentry(char *str, char *mask, char *find, char *repl)
*pmask = *pfind = *prepl = '\0';
! return (*mask && (*find || *repl)) ? true : false;
}
static void
addFlagValue(IspellDict *Conf, char *s, uint32 val)
{
--- 952,967 ----
*pmask = *pfind = *prepl = '\0';
! return (*mask && (*find || *repl));
}
+ /*
+ * Sets up a correspondence for the affix parameter with the affix flag.
+ *
+ * Conf: current dictionary.
+ * s: affix flag in string.
+ * val: affix parameter.
+ */
static void
addFlagValue(IspellDict *Conf, char *s, uint32 val)
{
***************
*** 731,742 **** addFlagValue(IspellDict *Conf, char *s, uint32 val)
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
Conf->usecompound = true;
}
/*
! * Import an affix file that follows MySpell or Hunspell format
*/
static void
NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 978,1043 ----
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[DecodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
Conf->usecompound = true;
}
/*
! * Returns a set of affix parameters which correspondence to the set of affix
! * flags s.
! */
! static int
! getFlagValues(IspellDict *Conf, char *s)
! {
! uint32 flag = 0;
! char *flagcur;
! char *flagnext = 0;
!
! flagcur = s;
! while (*flagcur)
! {
! flag |= Conf->flagval[DecodeFlag(Conf, flagcur, &flagnext)];
! if (flagnext)
! flagcur = flagnext;
! else
! break;
! }
!
! return flag;
! }
!
! /*
! * Returns a flag set using the s parameter.
! *
! * If Conf->useFlagAliases is true then the s parameter is index of the
! * Conf->AffixData array and function returns its entry.
! * Else function returns the s parameter.
! */
! static char *
! getFlags(IspellDict *Conf, char *s)
! {
! int curaffix;
! if (Conf->useFlagAliases)
! {
! curaffix = strtol(s, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! /*
! * Do not substract 1 from curaffix
! * because empty string was added in NIImportOOAffixes
! */
! return Conf->AffixData[curaffix];
! else
! return VoidString;
! }
! else
! return s;
! }
!
! /*
! * Import an affix file that follows MySpell or Hunspell format.
! *
! * Conf: current dictionary.
! * filename: path to the .affix file.
*/
static void
NIImportOOAffixes(IspellDict *Conf, const char *filename)
***************
*** 751,757 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int flag = 0;
char flagflags = 0;
tsearch_readline_state trst;
char *recoded;
--- 1052,1061 ----
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int naffix = 0,
! curaffix = 0;
! int flag = 0,
! sflaglen = 0;
char flagflags = 0;
tsearch_readline_state trst;
char *recoded;
***************
*** 759,764 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 1063,1070 ----
/* read file to find any flag */
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
if (!tsearch_readline_begin(&trst, filename))
ereport(ERROR,
***************
*** 806,815 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s && STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default flag value")));
}
pfree(recoded);
--- 1112,1128 ----
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s)
! {
! if (STRNCMP(s, "long") == 0)
! Conf->flagMode = FM_LONG;
! else if (STRNCMP(s, "num") == 0)
! Conf->flagMode = FM_NUM;
! else if (STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default, long and num flag value")));
! }
}
pfree(recoded);
***************
*** 834,860 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
if (fields_read < 4 ||
(STRNCMP(ptype, "sfx") != 0 && STRNCMP(ptype, "pfx") != 0))
goto nextline;
if (fields_read == 4)
{
! if (strlen(sflag) != 1)
! goto nextline;
! flag = *sflag;
! isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
else
{
char *ptr;
int aflg = 0;
! if (strlen(sflag) != 1 || flag != *sflag || flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* Find position of '/' in lowercased string "prepl" */
--- 1147,1223 ----
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
+
+ /* First try to parse AF parameter (alias compression) */
+ if (STRNCMP(ptype, "af") == 0)
+ {
+ /* First line is the number of aliases */
+ if (!Conf->useFlagAliases)
+ {
+ Conf->useFlagAliases = true;
+ naffix = atoi(sflag);
+ if (naffix == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid number of flag vector aliases")));
+
+ /* Also reserve place for empty flag set */
+ naffix++;
+
+ Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
+ Conf->lenAffixData = Conf->nAffixData = naffix;
+
+ /* Add empty flag set into AffixData */
+ Conf->AffixData[curaffix] = VoidString;
+ curaffix++;
+ }
+ /* Other lines is aliases */
+ else
+ {
+ if (curaffix < naffix)
+ {
+ Conf->AffixData[curaffix] = cpstrdup(Conf, sflag);
+ curaffix++;
+ }
+ }
+ goto nextline;
+ }
+ /* Else try to parse prefixes and suffixes */
if (fields_read < 4 ||
(STRNCMP(ptype, "sfx") != 0 && STRNCMP(ptype, "pfx") != 0))
goto nextline;
+ sflaglen = strlen(sflag);
+ if (sflaglen == 0
+ || (sflaglen > 1 && Conf->flagMode == FM_CHAR)
+ || (sflaglen > 2 && Conf->flagMode == FM_LONG))
+ goto nextline;
+
+ /*
+ * Affix header. For example:
+ * SFX \ N 1
+ */
if (fields_read == 4)
{
! /* Convert the affix flag to int */
! flag = DecodeFlag(Conf, sflag, (char **)NULL);
!
! isSuffix = (STRNCMP(ptype, "sfx") == 0);
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
+ /*
+ * Affix fields. For example:
+ * SFX \ 0 Y/L [^Y]
+ */
else
{
char *ptr;
int aflg = 0;
! if (flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* Find position of '/' in lowercased string "prepl" */
***************
*** 866,876 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
*/
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! while (*ptr)
! {
! aflg |= Conf->flagval[*(unsigned char *) ptr];
! ptr++;
! }
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
--- 1229,1235 ----
*/
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! aflg |= getFlagValues(Conf, getFlags(Conf, ptr));
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
***************
*** 928,933 **** NIImportAffixes(IspellDict *Conf, const char *filename)
--- 1287,1294 ----
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
while ((recoded = tsearch_readline(&trst)) != NULL)
{
***************
*** 1044,1049 **** isnewformat:
--- 1405,1415 ----
NIImportOOAffixes(Conf, filename);
}
+ /*
+ * Merges two affix flag sets and stores a new affix flag set into Conf->AffixData.
+ *
+ * Returns index of a new affix flag set.
+ */
static int
MergeAffix(IspellDict *Conf, int a1, int a2)
{
***************
*** 1068,1088 **** MergeAffix(IspellDict *Conf, int a1, int a2)
return Conf->nAffixData - 1;
}
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! uint32 flag = 0;
! char *str = Conf->AffixData[affix];
!
! while (str && *str)
! {
! flag |= Conf->flagval[*(unsigned char *) str];
! str++;
! }
!
! return (flag & FF_DICTFLAGMASK);
}
static SPNode *
mkSPNode(IspellDict *Conf, int low, int high, int level)
{
--- 1434,1458 ----
return Conf->nAffixData - 1;
}
+ /*
+ * Returns a set of affix parameters which correspondence to the set of affix
+ * flags with the given index.
+ */
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! char *str = Conf->AffixData[affix];
! return (getFlagValues(Conf, str) & FF_DICTFLAGMASK);
}
+ /*
+ * Makes a prefix tree for the given level.
+ *
+ * Conf: current dictionary.
+ * low: lower index of the Conf->Spell array.
+ * high: upper index of the Conf->Spell array.
+ * level: current prefix tree level.
+ */
static SPNode *
mkSPNode(IspellDict *Conf, int low, int high, int level)
{
***************
*** 1115,1120 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
--- 1485,1491 ----
{
if (lastchar)
{
+ /* Next level of the prefix tree */
data->node = mkSPNode(Conf, lownew, i, level + 1);
lownew = i;
data++;
***************
*** 1154,1159 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
--- 1525,1531 ----
}
}
+ /* Next level of the prefix tree */
data->node = mkSPNode(Conf, lownew, high, level + 1);
return rs;
***************
*** 1172,1215 **** NISortDictionary(IspellDict *Conf)
/* compress affixes */
- /* Count the number of different flags used in the dictionary */
-
- qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
-
- naffix = 0;
- for (i = 0; i < Conf->nspell; i++)
- {
- if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
- naffix++;
- }
-
/*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
*/
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
{
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
}
-
- Conf->Spell[i]->p.d.affix = curaffix;
- Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
! Conf->lenAffixData = Conf->nAffixData = naffix;
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
}
static AffixNode *
mkANode(IspellDict *Conf, int low, int high, int level, int type)
{
--- 1544,1622 ----
/* compress affixes */
/*
! * If we use flag aliases then we need to use Conf->AffixData filled
! * in the NIImportOOAffixes().
*/
! if (Conf->useFlagAliases)
{
! for (i = 0; i < Conf->nspell; i++)
{
! curaffix = strtol(Conf->Spell[i]->p.flag, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! Conf->Spell[i]->p.d.affix = curaffix;
! else
! /*
! * If Conf->Spell[i]->p.flag is empty, then get empty value of
! * Conf->AffixData (0 index).
! */
! Conf->Spell[i]->p.d.affix = 0;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
}
+ /* Otherwise fill Conf->AffixData here */
+ else
+ {
+ /* Count the number of different flags used in the dictionary */
+ qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
+
+ naffix = 0;
+ for (i = 0; i < Conf->nspell; i++)
+ {
+ if (i == 0 || strcmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag))
+ naffix++;
+ }
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0 || strcmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix]))
! {
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
! }
!
! Conf->Spell[i]->p.d.affix = curaffix;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
! }
!
! Conf->lenAffixData = Conf->nAffixData = naffix;
! }
+ /* Start build a prefix tree */
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
}
+ /*
+ * Makes a prefix tree for the given level using the repl string of an affix rule.
+ * Affixes with empty replace string do not include in the prefix tree. This
+ * affixes are included by mkVoidAffix().
+ *
+ * Conf: current dictionary.
+ * low: lower index of the Conf->Affix array.
+ * high: upper index of the Conf->Affix array.
+ * level: current prefix tree level.
+ * type: FF_SUFFIX or FF_PREFIX.
+ */
static AffixNode *
mkANode(IspellDict *Conf, int low, int high, int level, int type)
{
***************
*** 1247,1252 **** mkANode(IspellDict *Conf, int low, int high, int level, int type)
--- 1654,1660 ----
{
if (lastchar)
{
+ /* Next level of the prefix tree */
data->node = mkANode(Conf, lownew, i, level + 1, type);
if (naff)
{
***************
*** 1267,1272 **** mkANode(IspellDict *Conf, int low, int high, int level, int type)
--- 1675,1681 ----
}
}
+ /* Next level of the prefix tree */
data->node = mkANode(Conf, lownew, high, level + 1, type);
if (naff)
{
***************
*** 1281,1286 **** mkANode(IspellDict *Conf, int low, int high, int level, int type)
--- 1690,1699 ----
return rs;
}
+ /*
+ * Makes the root void node in the prefix tree. The root void node is created
+ * for affixes which have empty replace string ("repl" field).
+ */
static void
mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
{
***************
*** 1304,1314 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
Conf->Prefix = Affix;
}
!
for (i = start; i < end; i++)
if (Conf->Affix[i].replen == 0)
cnt++;
if (cnt == 0)
return;
--- 1717,1728 ----
Conf->Prefix = Affix;
}
! /* Count affixes with empty replace string */
for (i = start; i < end; i++)
if (Conf->Affix[i].replen == 0)
cnt++;
+ /* There is not affixes with empty replace string */
if (cnt == 0)
return;
***************
*** 1324,1341 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
}
}
static bool
! isAffixInUse(IspellDict *Conf, char flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (strchr(Conf->AffixData[i], flag) != NULL)
return true;
return false;
}
void
NISortAffixes(IspellDict *Conf)
{
--- 1738,1768 ----
}
}
+ /*
+ * Checks if the affixflag is used by dictionary. Conf->AffixData does not
+ * contain affixflag if this flag is not used actually by the .dict file.
+ *
+ * Conf: current dictionary.
+ * affixflag: integer representation of the affix flag.
+ *
+ * Returns true if the Conf->AffixData array contains affixflag, otherwise
+ * returns false.
+ */
static bool
! isAffixInUse(IspellDict *Conf, unsigned short affixflag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (IsAffixFlagInUse(Conf, i, affixflag))
return true;
return false;
}
+ /*
+ * Builds Conf->Prefix and Conf->Suffix trees from the imported affixes.
+ */
void
NISortAffixes(IspellDict *Conf)
{
***************
*** 1347,1352 **** NISortAffixes(IspellDict *Conf)
--- 1774,1780 ----
if (Conf->naffixes == 0)
return;
+ /* Store compound affixes in the Conf->CompoundAffix array */
if (Conf->naffixes > 1)
qsort((void *) Conf->Affix, Conf->naffixes, sizeof(AFFIX), cmpaffix);
Conf->CompoundAffix = ptr = (CMPDAffix *) palloc(sizeof(CMPDAffix) * Conf->naffixes);
***************
*** 1359,1365 **** NISortAffixes(IspellDict *Conf)
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, (char) Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
--- 1787,1793 ----
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
***************
*** 1370,1376 **** NISortAffixes(IspellDict *Conf)
/* leave only unique and minimals suffixes */
ptr->affix = Affix->repl;
ptr->len = Affix->replen;
! ptr->issuffix = (Affix->type == FF_SUFFIX) ? true : false;
ptr++;
}
}
--- 1798,1804 ----
/* leave only unique and minimals suffixes */
ptr->affix = Affix->repl;
ptr->len = Affix->replen;
! ptr->issuffix = (Affix->type == FF_SUFFIX);
ptr++;
}
}
***************
*** 1378,1383 **** NISortAffixes(IspellDict *Conf)
--- 1806,1812 ----
ptr->affix = NULL;
Conf->CompoundAffix = (CMPDAffix *) repalloc(Conf->CompoundAffix, sizeof(CMPDAffix) * (ptr - Conf->CompoundAffix + 1));
+ /* Start build a prefix tree */
Conf->Prefix = mkANode(Conf, 0, firstsuffix, 0, FF_PREFIX);
Conf->Suffix = mkANode(Conf, firstsuffix, Conf->naffixes, 0, FF_SUFFIX);
mkVoidAffix(Conf, true, firstsuffix);
***************
*** 1825,1831 **** SplitToVariants(IspellDict *Conf, SPNode *snode, SplitVar *orig, char *word, int
if (StopLow < StopHigh)
{
! if (level == FF_COMPOUNDBEGIN)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
--- 2254,2260 ----
if (StopLow < StopHigh)
{
! if (startpos == 0)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
*** a/src/backend/tsearch/synonym_sample.syn
--- /dev/null
***************
*** 1,5 ****
- postgres pgsql
- postgresql pgsql
- postgre pgsql
- gogle googl
- indices index*
--- 0 ----
*** a/src/backend/tsearch/thesaurus_sample.ths
--- /dev/null
***************
*** 1,17 ****
- #
- # Theasurus config file. Character ':' separates string from replacement, eg
- # sample-words : substitute-words
- #
- # Any substitute-word can be marked by preceding '*' character,
- # which means do not lexize this word
- # Docs: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Thesaurus_dictionary
-
- one two three : *123
- one two : *12
- one : *1
- two : *2
-
- supernovae stars : *sn
- supernovae : *sn
- booking tickets : order invitation cards
- booking ? tickets : order invitation Cards
--- 0 ----
*** a/src/include/tsearch/dicts/spell.h
--- b/src/include/tsearch/dicts/spell.h
***************
*** 19,36 ****
#include "tsearch/ts_public.h"
/*
! * Max length of a flag name. Names longer than this will be truncated
! * to the maximum.
*/
- #define MAXFLAGLEN 16
-
struct SPNode;
typedef struct
{
uint32 val:8,
isword:1,
compoundflag:4,
affix:19;
struct SPNode *node;
} SPNodeData;
--- 19,36 ----
#include "tsearch/ts_public.h"
/*
! * SPNode and SPNodeData are used to represent prefix tree (Trie) to store
! * a words list.
*/
struct SPNode;
typedef struct
{
uint32 val:8,
isword:1,
+ /* Stores compound flags listed below */
compoundflag:4,
+ /* Reference to an entry of the AffixData field */
affix:19;
struct SPNode *node;
} SPNodeData;
***************
*** 54,72 **** typedef struct SPNode
#define SPNHDRSZ (offsetof(SPNode,data))
!
typedef struct spell_struct
{
union
{
/*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
*/
! char flag[MAXFLAGLEN];
struct
{
int affix;
int len;
} d;
} p;
--- 54,77 ----
#define SPNHDRSZ (offsetof(SPNode,data))
! /*
! * Represents an entry in a words list.
! */
typedef struct spell_struct
{
union
{
/*
! * flag is filled in by NIImportDictionary(). After NISortDictionary(), d
! * is used instead of flag.
*/
! char *flag;
! /* d is used in mkSPNode() */
struct
{
+ /* Reference to an entry of the AffixData field */
int affix;
+ /* Length of the word */
int len;
} d;
} p;
***************
*** 75,84 **** typedef struct spell_struct
#define SPELLHDRSZ (offsetof(SPELL, word))
typedef struct aff_struct
{
! uint32 flag:8,
! type:1,
flagflags:7,
issimple:1,
isregis:1,
--- 80,93 ----
#define SPELLHDRSZ (offsetof(SPELL, word))
+ /*
+ * Represents an entry in an affix list.
+ */
typedef struct aff_struct
{
! uint32 flag:16;
! /* FF_SUFFIX or FF_PREFIX */
! uint32 type:1,
flagflags:7,
issimple:1,
isregis:1,
***************
*** 106,111 **** typedef struct aff_struct
--- 115,124 ----
#define FF_SUFFIX 1
#define FF_PREFIX 0
+ /*
+ * AffixNode and AffixNodeData are used to represent prefix tree (Trie) to store
+ * an affix list.
+ */
struct AffixNode;
typedef struct
***************
*** 132,137 **** typedef struct
--- 145,160 ----
bool issuffix;
} CMPDAffix;
+ typedef enum
+ {
+ FM_CHAR,
+ FM_LONG,
+ FM_NUM
+ } FlagMode;
+
+ #define FLAGCHAR_MAXSIZE 255
+ #define FLAGNUM_MAXSIZE 65535
+
typedef struct
{
int maffixes;
***************
*** 142,155 **** typedef struct
AffixNode *Prefix;
SPNode *Dictionary;
char **AffixData;
int lenAffixData;
int nAffixData;
CMPDAffix *CompoundAffix;
! unsigned char flagval[256];
bool usecompound;
/*
* Remaining fields are only used during dictionary construction; they are
--- 165,181 ----
AffixNode *Prefix;
SPNode *Dictionary;
+ /* Array of sets of affixes */
char **AffixData;
int lenAffixData;
int nAffixData;
+ bool useFlagAliases;
CMPDAffix *CompoundAffix;
! unsigned char flagval[FLAGNUM_MAXSIZE];
bool usecompound;
+ FlagMode flagMode;
/*
* Remaining fields are only used during dictionary construction; they are
*** a/src/test/regress/expected/tsdicts.out
--- b/src/test/regress/expected/tsdicts.out
***************
*** 191,196 **** SELECT ts_lexize('hunspell', 'footballyklubber');
--- 191,388 ----
{foot,ball,klubber}
(1 row)
+ -- Test ISpell dictionary with hunspell affix file with FLAG long parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_long (
+ Template=ispell,
+ DictFile=hunspell_sample_long,
+ AffFile=hunspell_sample_long
+ );
+ SELECT ts_lexize('hunspell_long', 'skies');
+ ts_lexize
+ -----------
+ {sky}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'bookings');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'booking');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'foot');
+ ts_lexize
+ -----------
+ {foot}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'foots');
+ ts_lexize
+ -----------
+ {foot}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'rebookings');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'rebooking');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'rebook');
+ ts_lexize
+ -----------
+
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'unbookings');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'unbooking');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'unbook');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'footklubber');
+ ts_lexize
+ ----------------
+ {foot,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'footballklubber');
+ ts_lexize
+ ------------------------------------------------------
+ {footballklubber,foot,ball,klubber,football,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'ballyklubber');
+ ts_lexize
+ ----------------
+ {ball,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'footballyklubber');
+ ts_lexize
+ ---------------------
+ {foot,ball,klubber}
+ (1 row)
+
+ -- Test ISpell dictionary with hunspell affix file with FLAG num parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_num (
+ Template=ispell,
+ DictFile=hunspell_sample_num,
+ AffFile=hunspell_sample_num
+ );
+ SELECT ts_lexize('hunspell_num', 'skies');
+ ts_lexize
+ -----------
+ {sky}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'bookings');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'booking');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'foot');
+ ts_lexize
+ -----------
+ {foot}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'foots');
+ ts_lexize
+ -----------
+ {foot}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'rebookings');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'rebooking');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'rebook');
+ ts_lexize
+ -----------
+
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'unbookings');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'unbooking');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'unbook');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'footklubber');
+ ts_lexize
+ ----------------
+ {foot,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'footballklubber');
+ ts_lexize
+ ------------------------------------------------------
+ {footballklubber,foot,ball,klubber,football,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'ballyklubber');
+ ts_lexize
+ ----------------
+ {ball,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'footballyklubber');
+ ts_lexize
+ ---------------------
+ {foot,ball,klubber}
+ (1 row)
+
-- Synonim dictionary
CREATE TEXT SEARCH DICTIONARY synonym (
Template=synonym,
***************
*** 277,282 **** SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
--- 469,516 ----
'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
(1 row)
+ -- Test ispell dictionary with hunspell affix with FLAG long in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ REPLACE hunspell WITH hunspell_long;
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ to_tsvector
+ ----------------------------------------------------------------------------------------------------
+ 'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
+ (1 row)
+
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ to_tsquery
+ ------------------------------------------------------------------------------
+ ( 'footballklubber' | 'foot' & 'ball' & 'klubber' ) | 'football' & 'klubber'
+ (1 row)
+
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+ to_tsquery
+ ------------------------------------------------------------------------
+ 'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
+ (1 row)
+
+ -- Test ispell dictionary with hunspell affix with FLAG num in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ REPLACE hunspell_long WITH hunspell_num;
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ to_tsvector
+ ----------------------------------------------------------------------------------------------------
+ 'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
+ (1 row)
+
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ to_tsquery
+ ------------------------------------------------------------------------------
+ ( 'footballklubber' | 'foot' & 'ball' & 'klubber' ) | 'football' & 'klubber'
+ (1 row)
+
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+ to_tsquery
+ ------------------------------------------------------------------------
+ 'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
+ (1 row)
+
-- Test synonym dictionary in configuration
CREATE TEXT SEARCH CONFIGURATION synonym_tst (
COPY=english
*** a/src/test/regress/sql/tsdicts.sql
--- b/src/test/regress/sql/tsdicts.sql
***************
*** 48,53 **** SELECT ts_lexize('hunspell', 'footballklubber');
--- 48,101 ----
SELECT ts_lexize('hunspell', 'ballyklubber');
SELECT ts_lexize('hunspell', 'footballyklubber');
+ -- Test ISpell dictionary with hunspell affix file with FLAG long parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_long (
+ Template=ispell,
+ DictFile=hunspell_sample_long,
+ AffFile=hunspell_sample_long
+ );
+
+ SELECT ts_lexize('hunspell_long', 'skies');
+ SELECT ts_lexize('hunspell_long', 'bookings');
+ SELECT ts_lexize('hunspell_long', 'booking');
+ SELECT ts_lexize('hunspell_long', 'foot');
+ SELECT ts_lexize('hunspell_long', 'foots');
+ SELECT ts_lexize('hunspell_long', 'rebookings');
+ SELECT ts_lexize('hunspell_long', 'rebooking');
+ SELECT ts_lexize('hunspell_long', 'rebook');
+ SELECT ts_lexize('hunspell_long', 'unbookings');
+ SELECT ts_lexize('hunspell_long', 'unbooking');
+ SELECT ts_lexize('hunspell_long', 'unbook');
+
+ SELECT ts_lexize('hunspell_long', 'footklubber');
+ SELECT ts_lexize('hunspell_long', 'footballklubber');
+ SELECT ts_lexize('hunspell_long', 'ballyklubber');
+ SELECT ts_lexize('hunspell_long', 'footballyklubber');
+
+ -- Test ISpell dictionary with hunspell affix file with FLAG num parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_num (
+ Template=ispell,
+ DictFile=hunspell_sample_num,
+ AffFile=hunspell_sample_num
+ );
+
+ SELECT ts_lexize('hunspell_num', 'skies');
+ SELECT ts_lexize('hunspell_num', 'bookings');
+ SELECT ts_lexize('hunspell_num', 'booking');
+ SELECT ts_lexize('hunspell_num', 'foot');
+ SELECT ts_lexize('hunspell_num', 'foots');
+ SELECT ts_lexize('hunspell_num', 'rebookings');
+ SELECT ts_lexize('hunspell_num', 'rebooking');
+ SELECT ts_lexize('hunspell_num', 'rebook');
+ SELECT ts_lexize('hunspell_num', 'unbookings');
+ SELECT ts_lexize('hunspell_num', 'unbooking');
+ SELECT ts_lexize('hunspell_num', 'unbook');
+
+ SELECT ts_lexize('hunspell_num', 'footklubber');
+ SELECT ts_lexize('hunspell_num', 'footballklubber');
+ SELECT ts_lexize('hunspell_num', 'ballyklubber');
+ SELECT ts_lexize('hunspell_num', 'footballyklubber');
+
-- Synonim dictionary
CREATE TEXT SEARCH DICTIONARY synonym (
Template=synonym,
***************
*** 94,99 **** SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footb
--- 142,163 ----
SELECT to_tsquery('hunspell_tst', 'footballklubber');
SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+ -- Test ispell dictionary with hunspell affix with FLAG long in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ REPLACE hunspell WITH hunspell_long;
+
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+
+ -- Test ispell dictionary with hunspell affix with FLAG num in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ REPLACE hunspell_long WITH hunspell_num;
+
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+
-- Test synonym dictionary in configuration
CREATE TEXT SEARCH CONFIGURATION synonym_tst (
COPY=english
On 16.02.2016 18:14, Artur Zakirov wrote:
I attached a new version of the patch.
Sorry for noise. I attached new version of the patch. I saw mistakes in
DecodeFlag(). This patch fix them.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
hunspell_dict_v7.patchtext/x-patch; name=hunspell_dict_v7.patchDownload
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***************
*** 2615,2632 **** SELECT plainto_tsquery('supernova star');
</para>
<para>
! To create an <application>Ispell</> dictionary, use the built-in
! <literal>ispell</literal> template and specify several parameters:
</para>
!
<programlisting>
! CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
</programlisting>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
--- 2615,2655 ----
</para>
<para>
! To create an <application>Ispell</> dictionary perform these steps:
</para>
! <itemizedlist spacing="compact" mark="bullet">
! <listitem>
! <para>
! download dictionary configuration files. <productname>OpenOffice</>
! extension files have the <filename>.oxt</> extension. It is necessary
! to extract <filename>.aff</> and <filename>.dic</> files, change
! extensions to <filename>.affix</> and <filename>.dict</>. For some
! dictionary files it is also needed to convert characters to the UTF-8
! encoding with commands (for example, for norwegian language dictionary):
<programlisting>
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! </programlisting>
! </para>
! </listitem>
! <listitem>
! <para>
! copy files to the <filename>$SHAREDIR/tsearch_data</> directory
! </para>
! </listitem>
! <listitem>
! <para>
! load files into PostgreSQL with the following command:
! <programlisting>
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
</programlisting>
+ </para>
+ </listitem>
+ </itemizedlist>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
***************
*** 2643,2648 **** CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2721 ----
</para>
<para>
+ The <filename>.affix</> file of <application>Ispell</> has the following
+ structure:
+ <programlisting>
+ prefixes
+ flag *A:
+ . > RE # As in enter > reenter
+ suffixes
+ flag T:
+ E > ST # As in late > latest
+ [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y > EST # As in gray > grayest
+ [^EY] > EST # As in small > smallest
+ </programlisting>
+ </para>
+ <para>
+ And the <filename>.dict</> file has the following structure:
+ <programlisting>
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ </programlisting>
+ </para>
+
+ <para>
+ Format of the <filename>.dict</> file is:
+ <programlisting>
+ basic_form/affix_class_name
+ </programlisting>
+ </para>
+
+ <para>
+ In the <filename>.affix</> file every affix flag is described in the
+ following format:
+ <programlisting>
+ condition > [-stripping_letters,] adding_affix
+ </programlisting>
+ </para>
+
+ <para>
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings <literal>[...]</> and <literal>[^...]</>.
+ For example, <literal>[AEIOU]Y</> means that the last letter of the word
+ is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
+ <literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
+ <literal>[^EY]</> means that the last letter is neither <literal>"e"</>
+ nor <literal>"y"</>.
+ </para>
+
+ <para>
Ispell dictionaries support splitting compound words;
a useful feature.
Notice that the affix file should specify a special flag using the
***************
*** 2663,2668 **** SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2736,2800 ----
</programlisting>
</para>
+ <para>
+ <application>MySpell</> is very similar to <application>Hunspell</>.
+ The <filename>.affix</> file of <application>Hunspell</> has the following
+ structure:
+ <programlisting>
+ PFX A Y 1
+ PFX A 0 re .
+ SFX T N 4
+ SFX T 0 st e
+ SFX T y iest [^aeiou]y
+ SFX T 0 est [aeiou]y
+ SFX T 0 est [^ey]
+ </programlisting>
+ </para>
+
+ <para>
+ The first line of an affix class is the header. Fields of an affix rules are
+ listed after the header:
+ </para>
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ parameter name (PFX or SFX)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ flag (name of the affix class)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ stripping characters from beginning (at prefix) or end (at suffix) of the
+ word
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ adding affix
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ condition that has a format similar to the format of regular expressions.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ The <filename>.dict</> file looks like the <filename>.dict</> file of
+ <application>Ispell</>:
+ <programlisting>
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ </programlisting>
+ </para>
+
<note>
<para>
<application>MySpell</> does not support compound words.
*** a/src/backend/tsearch/Makefile
--- b/src/backend/tsearch/Makefile
***************
*** 13,20 **** include $(top_builddir)/src/Makefile.global
DICTDIR=tsearch_data
! DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \
! ispell_sample.affix ispell_sample.dict
OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
--- 13,23 ----
DICTDIR=tsearch_data
! DICTFILES=dicts/synonym_sample.syn dicts/thesaurus_sample.ths \
! dicts/hunspell_sample.affix \
! dicts/ispell_sample.affix dicts/ispell_sample.dict \
! dicts/hunspell_sample_long.affix dicts/hunspell_sample_long.dict \
! dicts/hunspell_sample_num.affix dicts/hunspell_sample_num.dict
OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample.affix
***************
*** 0 ****
--- 1,24 ----
+ COMPOUNDFLAG Z
+ ONLYINCOMPOUND L
+
+ PFX B Y 1
+ PFX B 0 re .
+
+ PFX U N 1
+ PFX U 0 un .
+
+ SFX J Y 1
+ SFX J 0 INGS [^E]
+
+ SFX G Y 1
+ SFX G 0 ING [^E]
+
+ SFX S Y 1
+ SFX S 0 S [^SXZHY]
+
+ SFX A Y 1
+ SFX A Y IES [^AEIOU]Y
+
+ SFX \ N 1
+ SFX \ 0 Y/L [^Y]
+
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_long.affix
***************
*** 0 ****
--- 1,35 ----
+ FLAG long
+
+ AF 7
+ AF cZ #1
+ AF cL #2
+ AF sGsJpUsS #3
+ AF sSpB #4
+ AF cZsS #5
+ AF sScZs\ #6
+ AF sA #7
+
+ COMPOUNDFLAG cZ
+ ONLYINCOMPOUND cL
+
+ PFX pB Y 1
+ PFX pB 0 re .
+
+ PFX pU N 1
+ PFX pU 0 un .
+
+ SFX sJ Y 1
+ SFX sJ 0 INGS [^E]
+
+ SFX sG Y 1
+ SFX sG 0 ING [^E]
+
+ SFX sS Y 1
+ SFX sS 0 S [^SXZHY]
+
+ SFX sA Y 1
+ SFX sA Y IES [^AEIOU]Y
+
+ SFX s\ N 1
+ SFX s\ 0 Y/2 [^Y]
+
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_long.dict
***************
*** 0 ****
--- 1,8 ----
+ book/3
+ booking/4
+ footballklubber
+ foot/5
+ football/1
+ ball/6
+ klubber/1
+ sky/7
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_num.affix
***************
*** 0 ****
--- 1,26 ----
+ FLAG num
+
+ COMPOUNDFLAG 101
+ ONLYINCOMPOUND 102
+
+ PFX 201 Y 1
+ PFX 201 0 re .
+
+ PFX 202 N 1
+ PFX 202 0 un .
+
+ SFX 301 Y 1
+ SFX 301 0 INGS [^E]
+
+ SFX 302 Y 1
+ SFX 302 0 ING [^E]
+
+ SFX 303 Y 1
+ SFX 303 0 S [^SXZHY]
+
+ SFX 304 Y 1
+ SFX 304 Y IES [^AEIOU]Y
+
+ SFX 305 N 1
+ SFX 305 0 Y/102 [^Y]
+
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_num.dict
***************
*** 0 ****
--- 1,8 ----
+ book/302,301,202,303
+ booking/303,201
+ footballklubber
+ foot/101,303
+ football/101
+ ball/303,101,305
+ klubber/101
+ sky/304
*** /dev/null
--- b/src/backend/tsearch/dicts/ispell_sample.affix
***************
*** 0 ****
--- 1,26 ----
+ compoundwords controlled Z
+
+ prefixes
+
+ flag *B:
+ . > RE # As in enter > reenter
+
+ flag U:
+ . > UN # As in natural > unnatural
+
+ suffixes
+
+ flag *J:
+ [^E] > INGS # As in cross > crossings
+
+ flag *G:
+ [^E] > ING # As in cross > crossing
+
+ flag *S:
+ [^SXZHY] > S # As in bat > bats
+
+ flag *A:
+ [^AEIOU]Y > -Y,IES # As in imply > implies
+
+ flag ~\\:
+ [^Y] > Y #~ advarsel > advarsely-
*** /dev/null
--- b/src/backend/tsearch/dicts/ispell_sample.dict
***************
*** 0 ****
--- 1,8 ----
+ book/GJUS
+ booking/SB
+ footballklubber
+ foot/ZS
+ football/Z
+ ball/SZ\
+ klubber/Z
+ sky/A
*** /dev/null
--- b/src/backend/tsearch/dicts/synonym_sample.syn
***************
*** 0 ****
--- 1,5 ----
+ postgres pgsql
+ postgresql pgsql
+ postgre pgsql
+ gogle googl
+ indices index*
*** /dev/null
--- b/src/backend/tsearch/dicts/thesaurus_sample.ths
***************
*** 0 ****
--- 1,17 ----
+ #
+ # Theasurus config file. Character ':' separates string from replacement, eg
+ # sample-words : substitute-words
+ #
+ # Any substitute-word can be marked by preceding '*' character,
+ # which means do not lexize this word
+ # Docs: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Thesaurus_dictionary
+
+ one two three : *123
+ one two : *12
+ one : *1
+ two : *2
+
+ supernovae stars : *sn
+ supernovae : *sn
+ booking tickets : order invitation cards
+ booking ? tickets : order invitation Cards
*** a/src/backend/tsearch/hunspell_sample.affix
--- /dev/null
***************
*** 1,24 ****
- COMPOUNDFLAG Z
- ONLYINCOMPOUND L
-
- PFX B Y 1
- PFX B 0 re .
-
- PFX U N 1
- PFX U 0 un .
-
- SFX J Y 1
- SFX J 0 INGS [^E]
-
- SFX G Y 1
- SFX G 0 ING [^E]
-
- SFX S Y 1
- SFX S 0 S [^SXZHY]
-
- SFX A Y 1
- SFX A Y IES [^AEIOU]Y
-
- SFX \ N 1
- SFX \ 0 Y/L [^Y]
-
--- 0 ----
*** a/src/backend/tsearch/ispell_sample.affix
--- /dev/null
***************
*** 1,26 ****
- compoundwords controlled Z
-
- prefixes
-
- flag *B:
- . > RE # As in enter > reenter
-
- flag U:
- . > UN # As in natural > unnatural
-
- suffixes
-
- flag *J:
- [^E] > INGS # As in cross > crossings
-
- flag *G:
- [^E] > ING # As in cross > crossing
-
- flag *S:
- [^SXZHY] > S # As in bat > bats
-
- flag *A:
- [^AEIOU]Y > -Y,IES # As in imply > implies
-
- flag ~\\:
- [^Y] > Y #~ advarsel > advarsely-
--- 0 ----
*** a/src/backend/tsearch/ispell_sample.dict
--- /dev/null
***************
*** 1,8 ****
- book/GJUS
- booking/SB
- footballklubber
- foot/ZS
- football/Z
- ball/SZ\
- klubber/Z
- sky/A
--- 0 ----
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 5,10 ****
--- 5,58 ----
*
* Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
*
+ * Ispell dictionary
+ * -----------------
+ *
+ * Rules of dictionaries are defined in two files with .affix and .dict
+ * extensions. They are used by spell checker programs Ispell and Hunspell.
+ *
+ * An .affix file declares morphological rules to get a basic form of words.
+ * The format of an .affix file has different structure for Ispell and Hunspell
+ * dictionaries. The Hunspell format is more complicated. But when an .affix
+ * file is imported and compiled, it is stored in the same structure AffixNode.
+ *
+ * A .dict file stores a list of basic forms of words with references to
+ * affix rules. The format of a .dict file has the same structure for Ispell
+ * and Hunspell dictionaries.
+ *
+ * Compilation of a dictionary
+ * ---------------------------
+ *
+ * A compiled dictionary is stored in the IspellDict structure. Compilation of
+ * a dictionary is divided into the several steps:
+ * - NIImportDictionary() - stores each word of a .dict file in the
+ * temporary Spell field.
+ * - NIImportAffixes() - stores affix rules of an .affix file in the
+ * Affix field (not temporary) if an .affix file has the Ispell format.
+ * -> NIImportOOAffixes() - stores affix rules if an .affix file has the
+ * Hunspell format. The AffixData field is initialized if AF parameter
+ * is defined.
+ * - NISortDictionary() - builds a prefix tree (Trie) from the words list
+ * and stores it in the Dictionary field. The words list is got from the
+ * Spell field. The AffixData field is initialized if AF parameter is not
+ * defined.
+ * - NISortAffixes():
+ * - builds a list of compond affixes from the affix list and stores it
+ * in the CompoundAffix.
+ * - builds prefix trees (Trie) from the affix list for prefixes and suffixes
+ * and stores them in Suffix and Prefix fields.
+ * The affix list is got from the Affix field.
+ *
+ * Memory management
+ * -----------------
+ *
+ * The IspellDict structure has the Spell field which is used only in compile
+ * time. The Spell field stores a words list. It can take a lot of memory.
+ * Therefore when a dictionary is compiled this field is cleared by
+ * NIFinishBuild().
+ *
+ * All resources which should cleared by NIFinishBuild() is initialized using
+ * tmpalloc() and tmpalloc0().
*
* IDENTIFICATION
* src/backend/tsearch/spell.c
***************
*** 153,159 **** cmpspell(const void *s1, const void *s2)
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
}
static char *
--- 201,208 ----
static int
cmpspellaffix(const void *s1, const void *s2)
{
! return (strcmp((*(SPELL *const *) s1)->p.flag,
! (*(SPELL *const *) s2)->p.flag));
}
static char *
***************
*** 220,225 **** strbncmp(const unsigned char *s1, const unsigned char *s2, size_t count)
--- 269,279 ----
return 0;
}
+ /*
+ * Compares affixes.
+ * First compares the type of an affix. Prefixes should go before affixes.
+ * If types are equal then compares replaceable string.
+ */
static int
cmpaffix(const void *s1, const void *s2)
{
***************
*** 237,242 **** cmpaffix(const void *s1, const void *s2)
--- 291,425 ----
(const unsigned char *) a2->repl);
}
+ /*
+ * Gets an affix flag from string representation (a set of affixes).
+ *
+ * Several flags can be stored in a single string. Flags can be represented by:
+ * - 1 character (FM_CHAR).
+ * - 2 characters (FM_LONG).
+ * - numbers from 1 to 65000 (FM_NUM).
+ *
+ * Depending on the flagMode an affix string can have the following format:
+ * - FM_CHAR: ABCD
+ * Here we have 4 flags: A, B, C and D
+ * - FM_LONG: ABCDE*
+ * Here we have 3 flags: AB, CD and E*
+ * - FM_NUM: 200,205,50
+ * Here we have 3 flags: 200, 205 and 50
+ *
+ * Conf: current dictionary.
+ * sflag: string representation (a set of affixes) of an affix flag.
+ * sflagnext: returns reference to the start of a next affix flag in the sflag.
+ *
+ * Returns an integer representation of the affix flag.
+ */
+ static unsigned short
+ DecodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ int64 s;
+ char *next;
+
+ switch (Conf->flagMode)
+ {
+ case FM_LONG:
+ s = (int)(((unsigned char *)sflag)[0]) << 8
+ | (int)(((unsigned char *)sflag)[1]);
+ if (sflagnext)
+ /* Go to start of the next flag */
+ *sflagnext = sflag + pg_mblen(sflag) * 2;
+ break;
+ case FM_NUM:
+ s = strtol(sflag, &next, 10);
+ if (s >= FLAGNUM_MAXSIZE)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid affix flag \"%s\"", sflag)));
+
+ if (sflagnext)
+ {
+ /* Go to start of the next flag */
+ if (next)
+ {
+ *sflagnext = next;
+ while (**sflagnext)
+ {
+ if (**sflagnext == ',')
+ {
+ /* Found start of the next flag */
+ *sflagnext += pg_mblen(*sflagnext);
+ break;
+ }
+ *sflagnext += pg_mblen(*sflagnext);
+ }
+ }
+ else
+ *sflagnext = 0;
+ }
+ break;
+ default:
+ s = (int64) *((unsigned char *)sflag);
+ if (s >= FLAGCHAR_MAXSIZE)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid affix flag \"%s\"", sflag)));
+
+ if (sflagnext)
+ /* Go to start of the next flag */
+ *sflagnext = sflag + pg_mblen(sflag);
+ }
+
+ return s;
+ }
+
+ /*
+ * Checks if the affix set Conf->AffixData[affix] contains affixflag.
+ * Conf->AffixData[affix] is the string representation of an affix flags.
+ * Conf->AffixData[affix] does not contain affixflag if this flag is not used
+ * actually by the .dict file.
+ *
+ * Conf: current dictionary.
+ * affix: index of the Conf->AffixData array.
+ * affixflag: integer representation of the affix flag.
+ *
+ * Returns true if the string Conf->AffixData[affix] contains affixflag,
+ * otherwise returns false.
+ */
+ static bool
+ IsAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ char *flagcur;
+ char *flagnext = 0;
+
+ if (affixflag == 0)
+ return true;
+
+ flagcur = Conf->AffixData[affix];
+
+ while (*flagcur)
+ {
+ /* Compare first affix flag in flagcur with affixflag */
+ if (DecodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ return true;
+ /* Otherwise go to next flag */
+ if (flagnext)
+ flagcur = flagnext;
+ /* If we have not flags anymore then exit */
+ else
+ break;
+ }
+
+ /* Could not find affixflag */
+ return false;
+ }
+
+ /*
+ * Adds the new word into the temporary array Spell.
+ *
+ * Conf: current dictionary.
+ * word: new word.
+ * flag: set of affix flags. Integer representation of flag can be got by
+ * DecodeFlag().
+ */
static void
NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
{
***************
*** 255,268 **** NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
Conf->nspell++;
}
/*
! * import dictionary
*
! * Note caller must already have applied get_tsearch_config_filename
*/
void
NIImportDictionary(IspellDict *Conf, const char *filename)
--- 438,455 ----
}
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
! Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0')
! ? cpstrdup(Conf, flag) : VoidString;
Conf->nspell++;
}
/*
! * Imports dictionary into the temporary array Spell.
*
! * Note caller must already have applied get_tsearch_config_filename.
! *
! * Conf: current dictionary.
! * filename: path to the .dict file.
*/
void
NIImportDictionary(IspellDict *Conf, const char *filename)
***************
*** 280,285 **** NIImportDictionary(IspellDict *Conf, const char *filename)
--- 467,473 ----
{
char *s,
*pstr;
+ /* Set of affix flags */
const char *flag;
/* Extract flag from the line */
***************
*** 324,330 **** NIImportDictionary(IspellDict *Conf, const char *filename)
tsearch_readline_end(&trst);
}
!
static int
FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
{
--- 512,541 ----
tsearch_readline_end(&trst);
}
! /*
! * Searches a basic form of word in the prefix tree. This word was generated
! * using an affix rule. This rule may not be presented in an affix set of
! * a basic form of word.
! *
! * For example, we have the entry in the .dict file:
! * meter/GMD
! *
! * The affix rule with the flag S:
! * SFX S y ies [^aeiou]y
! * is not presented here.
! *
! * The affix rule with the flag M:
! * SFX M 0 's .
! * is presented here.
! *
! * Conf: current dictionary.
! * word: basic form of word.
! * affixflag: integer representation of the affix flag, by which a basic form of
! * word was generated.
! * flag: compound flag used to compare with StopMiddle->compoundflag.
! *
! * Returns 1 if the word was found in the prefix tree, else returns 0.
! */
static int
FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
{
***************
*** 349,361 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
{
if (flag == 0)
{
if (StopMiddle->compoundflag & FF_COMPOUNDONLY)
return 0;
}
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
return 1;
}
node = StopMiddle->node;
--- 560,581 ----
{
if (flag == 0)
{
+ /*
+ * The word can be formed only with another word.
+ * And in the flag parameter there is not a sign
+ * that we search compound words.
+ */
if (StopMiddle->compoundflag & FF_COMPOUNDONLY)
return 0;
}
else if ((flag & StopMiddle->compoundflag) == 0)
return 0;
! /*
! * Check if this affix rule is presented in the affix set
! * with index StopMiddle->affix.
! */
! if (IsAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
return 1;
}
node = StopMiddle->node;
***************
*** 373,378 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
--- 593,616 ----
return 0;
}
+ /*
+ * Adds a new affix rule to the Affix field.
+ *
+ * Conf: current dictionary.
+ * flag: integer representation of the affix flag ('\' in the below example).
+ * flagflags: set of flags from the flagval field for this affix rule. This set
+ * is listed after '/' character in the added string (repl).
+ *
+ * For example L flag in the hunspell_sample.affix:
+ * SFX \ 0 Y/L [^Y]
+ *
+ * mask: condition for search ('[^Y]' in the above example).
+ * find: stripping characters from beginning (at prefix) or end (at suffix)
+ * of the word ('0' in the above example, 0 means that there is not
+ * stripping character).
+ * repl: adding string after stripping ('Y' in the above example).
+ * type: FF_SUFFIX or FF_PREFIX.
+ */
static void
NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const char *find, const char *repl, int type)
{
***************
*** 394,411 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Affix = Conf->Affix + Conf->naffixes;
! if (strcmp(mask, ".") == 0)
{
Affix->issimple = 1;
Affix->isregis = 0;
}
else if (RS_isRegis(mask))
{
Affix->issimple = 0;
Affix->isregis = 1;
! RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX) ? true : false,
*mask ? mask : VoidString);
}
else
{
int masklen;
--- 632,652 ----
Affix = Conf->Affix + Conf->naffixes;
! /* This affix rule can be applied for words with any ending */
! if (strcmp(mask, ".") == 0 || *mask == '\0')
{
Affix->issimple = 1;
Affix->isregis = 0;
}
+ /* This affix rule will use regis to search word ending */
else if (RS_isRegis(mask))
{
Affix->issimple = 0;
Affix->isregis = 1;
! RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX),
*mask ? mask : VoidString);
}
+ /* This affix rule will use regex_t to search word ending */
else
{
int masklen;
***************
*** 457,463 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
Conf->naffixes++;
}
-
/* Parsing states for parse_affentry() and friends */
#define PAE_WAIT_MASK 0
#define PAE_INMASK 1
--- 698,703 ----
***************
*** 712,720 **** parse_affentry(char *str, char *mask, char *find, char *repl)
*pmask = *pfind = *prepl = '\0';
! return (*mask && (*find || *repl)) ? true : false;
}
static void
addFlagValue(IspellDict *Conf, char *s, uint32 val)
{
--- 952,967 ----
*pmask = *pfind = *prepl = '\0';
! return (*mask && (*find || *repl));
}
+ /*
+ * Sets up a correspondence for the affix parameter with the affix flag.
+ *
+ * Conf: current dictionary.
+ * s: affix flag in string.
+ * val: affix parameter.
+ */
static void
addFlagValue(IspellDict *Conf, char *s, uint32 val)
{
***************
*** 731,742 **** addFlagValue(IspellDict *Conf, char *s, uint32 val)
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
Conf->usecompound = true;
}
/*
! * Import an affix file that follows MySpell or Hunspell format
*/
static void
NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 978,1043 ----
(errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("multibyte flag character is not allowed")));
! Conf->flagval[DecodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
Conf->usecompound = true;
}
/*
! * Returns a set of affix parameters which correspondence to the set of affix
! * flags s.
! */
! static int
! getFlagValues(IspellDict *Conf, char *s)
! {
! uint32 flag = 0;
! char *flagcur;
! char *flagnext = 0;
!
! flagcur = s;
! while (*flagcur)
! {
! flag |= Conf->flagval[DecodeFlag(Conf, flagcur, &flagnext)];
! if (flagnext)
! flagcur = flagnext;
! else
! break;
! }
!
! return flag;
! }
!
! /*
! * Returns a flag set using the s parameter.
! *
! * If Conf->useFlagAliases is true then the s parameter is index of the
! * Conf->AffixData array and function returns its entry.
! * Else function returns the s parameter.
! */
! static char *
! getFlags(IspellDict *Conf, char *s)
! {
! int curaffix;
! if (Conf->useFlagAliases)
! {
! curaffix = strtol(s, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! /*
! * Do not substract 1 from curaffix
! * because empty string was added in NIImportOOAffixes
! */
! return Conf->AffixData[curaffix];
! else
! return VoidString;
! }
! else
! return s;
! }
!
! /*
! * Import an affix file that follows MySpell or Hunspell format.
! *
! * Conf: current dictionary.
! * filename: path to the .affix file.
*/
static void
NIImportOOAffixes(IspellDict *Conf, const char *filename)
***************
*** 751,757 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int flag = 0;
char flagflags = 0;
tsearch_readline_state trst;
char *recoded;
--- 1052,1061 ----
char repl[BUFSIZ],
*prepl;
bool isSuffix = false;
! int naffix = 0,
! curaffix = 0;
! int flag = 0,
! sflaglen = 0;
char flagflags = 0;
tsearch_readline_state trst;
char *recoded;
***************
*** 759,764 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 1063,1070 ----
/* read file to find any flag */
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
if (!tsearch_readline_begin(&trst, filename))
ereport(ERROR,
***************
*** 806,815 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s && STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default flag value")));
}
pfree(recoded);
--- 1112,1129 ----
while (*s && t_isspace(s))
s += pg_mblen(s);
! if (*s)
! {
! if (STRNCMP(s, "long") == 0)
! Conf->flagMode = FM_LONG;
! else if (STRNCMP(s, "num") == 0)
! Conf->flagMode = FM_NUM;
! else if (STRNCMP(s, "default") != 0)
! ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
! errmsg("Ispell dictionary supports only default, "
! "long and num flag value")));
! }
}
pfree(recoded);
***************
*** 834,860 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
if (fields_read < 4 ||
(STRNCMP(ptype, "sfx") != 0 && STRNCMP(ptype, "pfx") != 0))
goto nextline;
if (fields_read == 4)
{
! if (strlen(sflag) != 1)
! goto nextline;
! flag = *sflag;
! isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
else
{
char *ptr;
int aflg = 0;
! if (strlen(sflag) != 1 || flag != *sflag || flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* Find position of '/' in lowercased string "prepl" */
--- 1148,1224 ----
if (ptype)
pfree(ptype);
ptype = lowerstr_ctx(Conf, type);
+
+ /* First try to parse AF parameter (alias compression) */
+ if (STRNCMP(ptype, "af") == 0)
+ {
+ /* First line is the number of aliases */
+ if (!Conf->useFlagAliases)
+ {
+ Conf->useFlagAliases = true;
+ naffix = atoi(sflag);
+ if (naffix == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+ errmsg("invalid number of flag vector aliases")));
+
+ /* Also reserve place for empty flag set */
+ naffix++;
+
+ Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
+ Conf->lenAffixData = Conf->nAffixData = naffix;
+
+ /* Add empty flag set into AffixData */
+ Conf->AffixData[curaffix] = VoidString;
+ curaffix++;
+ }
+ /* Other lines is aliases */
+ else
+ {
+ if (curaffix < naffix)
+ {
+ Conf->AffixData[curaffix] = cpstrdup(Conf, sflag);
+ curaffix++;
+ }
+ }
+ goto nextline;
+ }
+ /* Else try to parse prefixes and suffixes */
if (fields_read < 4 ||
(STRNCMP(ptype, "sfx") != 0 && STRNCMP(ptype, "pfx") != 0))
goto nextline;
+ sflaglen = strlen(sflag);
+ if (sflaglen == 0
+ || (sflaglen > 1 && Conf->flagMode == FM_CHAR)
+ || (sflaglen > 2 && Conf->flagMode == FM_LONG))
+ goto nextline;
+
+ /*
+ * Affix header. For example:
+ * SFX \ N 1
+ */
if (fields_read == 4)
{
! /* Convert the affix flag to int */
! flag = DecodeFlag(Conf, sflag, (char **)NULL);
!
! isSuffix = (STRNCMP(ptype, "sfx") == 0);
if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
flagflags = FF_CROSSPRODUCT;
else
flagflags = 0;
}
+ /*
+ * Affix fields. For example:
+ * SFX \ 0 Y/L [^Y]
+ */
else
{
char *ptr;
int aflg = 0;
! if (flag == 0)
goto nextline;
prepl = lowerstr_ctx(Conf, repl);
/* Find position of '/' in lowercased string "prepl" */
***************
*** 866,876 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
*/
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! while (*ptr)
! {
! aflg |= Conf->flagval[*(unsigned char *) ptr];
! ptr++;
! }
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
--- 1230,1236 ----
*/
*ptr = '\0';
ptr = repl + (ptr - prepl) + 1;
! aflg |= getFlagValues(Conf, getFlags(Conf, ptr));
}
pfind = lowerstr_ctx(Conf, find);
pmask = lowerstr_ctx(Conf, mask);
***************
*** 928,933 **** NIImportAffixes(IspellDict *Conf, const char *filename)
--- 1288,1295 ----
memset(Conf->flagval, 0, sizeof(Conf->flagval));
Conf->usecompound = false;
+ Conf->useFlagAliases = false;
+ Conf->flagMode = FM_CHAR;
while ((recoded = tsearch_readline(&trst)) != NULL)
{
***************
*** 1044,1049 **** isnewformat:
--- 1406,1417 ----
NIImportOOAffixes(Conf, filename);
}
+ /*
+ * Merges two affix flag sets and stores a new affix flag set into
+ * Conf->AffixData.
+ *
+ * Returns index of a new affix flag set.
+ */
static int
MergeAffix(IspellDict *Conf, int a1, int a2)
{
***************
*** 1068,1088 **** MergeAffix(IspellDict *Conf, int a1, int a2)
return Conf->nAffixData - 1;
}
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! uint32 flag = 0;
! char *str = Conf->AffixData[affix];
!
! while (str && *str)
! {
! flag |= Conf->flagval[*(unsigned char *) str];
! str++;
! }
!
! return (flag & FF_DICTFLAGMASK);
}
static SPNode *
mkSPNode(IspellDict *Conf, int low, int high, int level)
{
--- 1436,1460 ----
return Conf->nAffixData - 1;
}
+ /*
+ * Returns a set of affix parameters which correspondence to the set of affix
+ * flags with the given index.
+ */
static uint32
makeCompoundFlags(IspellDict *Conf, int affix)
{
! char *str = Conf->AffixData[affix];
! return (getFlagValues(Conf, str) & FF_DICTFLAGMASK);
}
+ /*
+ * Makes a prefix tree for the given level.
+ *
+ * Conf: current dictionary.
+ * low: lower index of the Conf->Spell array.
+ * high: upper index of the Conf->Spell array.
+ * level: current prefix tree level.
+ */
static SPNode *
mkSPNode(IspellDict *Conf, int low, int high, int level)
{
***************
*** 1115,1120 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
--- 1487,1493 ----
{
if (lastchar)
{
+ /* Next level of the prefix tree */
data->node = mkSPNode(Conf, lownew, i, level + 1);
lownew = i;
data++;
***************
*** 1154,1159 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
--- 1527,1533 ----
}
}
+ /* Next level of the prefix tree */
data->node = mkSPNode(Conf, lownew, high, level + 1);
return rs;
***************
*** 1172,1215 **** NISortDictionary(IspellDict *Conf)
/* compress affixes */
- /* Count the number of different flags used in the dictionary */
-
- qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
-
- naffix = 0;
- for (i = 0; i < Conf->nspell; i++)
- {
- if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
- naffix++;
- }
-
/*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
*/
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
{
! if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
{
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
}
-
- Conf->Spell[i]->p.d.affix = curaffix;
- Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
! Conf->lenAffixData = Conf->nAffixData = naffix;
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
}
static AffixNode *
mkANode(IspellDict *Conf, int low, int high, int level, int type)
{
--- 1546,1628 ----
/* compress affixes */
/*
! * If we use flag aliases then we need to use Conf->AffixData filled
! * in the NIImportOOAffixes().
*/
! if (Conf->useFlagAliases)
{
! for (i = 0; i < Conf->nspell; i++)
{
! curaffix = strtol(Conf->Spell[i]->p.flag, (char **)NULL, 10);
! if (curaffix && curaffix <= Conf->nAffixData)
! Conf->Spell[i]->p.d.affix = curaffix;
! else
! /*
! * If Conf->Spell[i]->p.flag is empty, then get empty value of
! * Conf->AffixData (0 index).
! */
! Conf->Spell[i]->p.d.affix = 0;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
}
}
+ /* Otherwise fill Conf->AffixData here */
+ else
+ {
+ /* Count the number of different flags used in the dictionary */
+ qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *),
+ cmpspellaffix);
+
+ naffix = 0;
+ for (i = 0; i < Conf->nspell; i++)
+ {
+ if (i == 0
+ || strcmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag))
+ naffix++;
+ }
! /*
! * Fill in Conf->AffixData with the affixes that were used in the
! * dictionary. Replace textual flag-field of Conf->Spell entries with
! * indexes into Conf->AffixData array.
! */
! Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
!
! curaffix = -1;
! for (i = 0; i < Conf->nspell; i++)
! {
! if (i == 0
! || strcmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix]))
! {
! curaffix++;
! Assert(curaffix < naffix);
! Conf->AffixData[curaffix] = cpstrdup(Conf,
! Conf->Spell[i]->p.flag);
! }
!
! Conf->Spell[i]->p.d.affix = curaffix;
! Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
! }
!
! Conf->lenAffixData = Conf->nAffixData = naffix;
! }
+ /* Start build a prefix tree */
qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
}
+ /*
+ * Makes a prefix tree for the given level using the repl string of an affix
+ * rule. Affixes with empty replace string do not include in the prefix tree.
+ * This affixes are included by mkVoidAffix().
+ *
+ * Conf: current dictionary.
+ * low: lower index of the Conf->Affix array.
+ * high: upper index of the Conf->Affix array.
+ * level: current prefix tree level.
+ * type: FF_SUFFIX or FF_PREFIX.
+ */
static AffixNode *
mkANode(IspellDict *Conf, int low, int high, int level, int type)
{
***************
*** 1247,1252 **** mkANode(IspellDict *Conf, int low, int high, int level, int type)
--- 1660,1666 ----
{
if (lastchar)
{
+ /* Next level of the prefix tree */
data->node = mkANode(Conf, lownew, i, level + 1, type);
if (naff)
{
***************
*** 1267,1272 **** mkANode(IspellDict *Conf, int low, int high, int level, int type)
--- 1681,1687 ----
}
}
+ /* Next level of the prefix tree */
data->node = mkANode(Conf, lownew, high, level + 1, type);
if (naff)
{
***************
*** 1281,1286 **** mkANode(IspellDict *Conf, int low, int high, int level, int type)
--- 1696,1705 ----
return rs;
}
+ /*
+ * Makes the root void node in the prefix tree. The root void node is created
+ * for affixes which have empty replace string ("repl" field).
+ */
static void
mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
{
***************
*** 1304,1314 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
Conf->Prefix = Affix;
}
!
for (i = start; i < end; i++)
if (Conf->Affix[i].replen == 0)
cnt++;
if (cnt == 0)
return;
--- 1723,1734 ----
Conf->Prefix = Affix;
}
! /* Count affixes with empty replace string */
for (i = start; i < end; i++)
if (Conf->Affix[i].replen == 0)
cnt++;
+ /* There is not affixes with empty replace string */
if (cnt == 0)
return;
***************
*** 1324,1341 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
}
}
static bool
! isAffixInUse(IspellDict *Conf, char flag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (strchr(Conf->AffixData[i], flag) != NULL)
return true;
return false;
}
void
NISortAffixes(IspellDict *Conf)
{
--- 1744,1774 ----
}
}
+ /*
+ * Checks if the affixflag is used by dictionary. Conf->AffixData does not
+ * contain affixflag if this flag is not used actually by the .dict file.
+ *
+ * Conf: current dictionary.
+ * affixflag: integer representation of the affix flag.
+ *
+ * Returns true if the Conf->AffixData array contains affixflag, otherwise
+ * returns false.
+ */
static bool
! isAffixInUse(IspellDict *Conf, unsigned short affixflag)
{
int i;
for (i = 0; i < Conf->nAffixData; i++)
! if (IsAffixFlagInUse(Conf, i, affixflag))
return true;
return false;
}
+ /*
+ * Builds Conf->Prefix and Conf->Suffix trees from the imported affixes.
+ */
void
NISortAffixes(IspellDict *Conf)
{
***************
*** 1347,1352 **** NISortAffixes(IspellDict *Conf)
--- 1780,1786 ----
if (Conf->naffixes == 0)
return;
+ /* Store compound affixes in the Conf->CompoundAffix array */
if (Conf->naffixes > 1)
qsort((void *) Conf->Affix, Conf->naffixes, sizeof(AFFIX), cmpaffix);
Conf->CompoundAffix = ptr = (CMPDAffix *) palloc(sizeof(CMPDAffix) * Conf->naffixes);
***************
*** 1359,1365 **** NISortAffixes(IspellDict *Conf)
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, (char) Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
--- 1793,1799 ----
firstsuffix = i;
if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! isAffixInUse(Conf, Affix->flag))
{
if (ptr == Conf->CompoundAffix ||
ptr->issuffix != (ptr - 1)->issuffix ||
***************
*** 1370,1376 **** NISortAffixes(IspellDict *Conf)
/* leave only unique and minimals suffixes */
ptr->affix = Affix->repl;
ptr->len = Affix->replen;
! ptr->issuffix = (Affix->type == FF_SUFFIX) ? true : false;
ptr++;
}
}
--- 1804,1810 ----
/* leave only unique and minimals suffixes */
ptr->affix = Affix->repl;
ptr->len = Affix->replen;
! ptr->issuffix = (Affix->type == FF_SUFFIX);
ptr++;
}
}
***************
*** 1378,1383 **** NISortAffixes(IspellDict *Conf)
--- 1812,1818 ----
ptr->affix = NULL;
Conf->CompoundAffix = (CMPDAffix *) repalloc(Conf->CompoundAffix, sizeof(CMPDAffix) * (ptr - Conf->CompoundAffix + 1));
+ /* Start build a prefix tree */
Conf->Prefix = mkANode(Conf, 0, firstsuffix, 0, FF_PREFIX);
Conf->Suffix = mkANode(Conf, firstsuffix, Conf->naffixes, 0, FF_SUFFIX);
mkVoidAffix(Conf, true, firstsuffix);
***************
*** 1825,1831 **** SplitToVariants(IspellDict *Conf, SPNode *snode, SplitVar *orig, char *word, int
if (StopLow < StopHigh)
{
! if (level == FF_COMPOUNDBEGIN)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
--- 2260,2266 ----
if (StopLow < StopHigh)
{
! if (startpos == 0)
compoundflag = FF_COMPOUNDBEGIN;
else if (level == wordlen - 1)
compoundflag = FF_COMPOUNDLAST;
*** a/src/backend/tsearch/synonym_sample.syn
--- /dev/null
***************
*** 1,5 ****
- postgres pgsql
- postgresql pgsql
- postgre pgsql
- gogle googl
- indices index*
--- 0 ----
*** a/src/backend/tsearch/thesaurus_sample.ths
--- /dev/null
***************
*** 1,17 ****
- #
- # Theasurus config file. Character ':' separates string from replacement, eg
- # sample-words : substitute-words
- #
- # Any substitute-word can be marked by preceding '*' character,
- # which means do not lexize this word
- # Docs: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Thesaurus_dictionary
-
- one two three : *123
- one two : *12
- one : *1
- two : *2
-
- supernovae stars : *sn
- supernovae : *sn
- booking tickets : order invitation cards
- booking ? tickets : order invitation Cards
--- 0 ----
*** a/src/include/tsearch/dicts/spell.h
--- b/src/include/tsearch/dicts/spell.h
***************
*** 19,36 ****
#include "tsearch/ts_public.h"
/*
! * Max length of a flag name. Names longer than this will be truncated
! * to the maximum.
*/
- #define MAXFLAGLEN 16
-
struct SPNode;
typedef struct
{
uint32 val:8,
isword:1,
compoundflag:4,
affix:19;
struct SPNode *node;
} SPNodeData;
--- 19,36 ----
#include "tsearch/ts_public.h"
/*
! * SPNode and SPNodeData are used to represent prefix tree (Trie) to store
! * a words list.
*/
struct SPNode;
typedef struct
{
uint32 val:8,
isword:1,
+ /* Stores compound flags listed below */
compoundflag:4,
+ /* Reference to an entry of the AffixData field */
affix:19;
struct SPNode *node;
} SPNodeData;
***************
*** 43,49 **** typedef struct
#define FF_COMPOUNDBEGIN 0x02
#define FF_COMPOUNDMIDDLE 0x04
#define FF_COMPOUNDLAST 0x08
! #define FF_COMPOUNDFLAG ( FF_COMPOUNDBEGIN | FF_COMPOUNDMIDDLE | FF_COMPOUNDLAST )
#define FF_DICTFLAGMASK 0x0f
typedef struct SPNode
--- 43,50 ----
#define FF_COMPOUNDBEGIN 0x02
#define FF_COMPOUNDMIDDLE 0x04
#define FF_COMPOUNDLAST 0x08
! #define FF_COMPOUNDFLAG ( FF_COMPOUNDBEGIN | FF_COMPOUNDMIDDLE | \
! FF_COMPOUNDLAST )
#define FF_DICTFLAGMASK 0x0f
typedef struct SPNode
***************
*** 54,72 **** typedef struct SPNode
#define SPNHDRSZ (offsetof(SPNode,data))
!
typedef struct spell_struct
{
union
{
/*
! * flag is filled in by NIImportDictionary. After NISortDictionary, d
! * is valid and flag is invalid.
*/
! char flag[MAXFLAGLEN];
struct
{
int affix;
int len;
} d;
} p;
--- 55,78 ----
#define SPNHDRSZ (offsetof(SPNode,data))
! /*
! * Represents an entry in a words list.
! */
typedef struct spell_struct
{
union
{
/*
! * flag is filled in by NIImportDictionary(). After NISortDictionary(),
! * d is used instead of flag.
*/
! char *flag;
! /* d is used in mkSPNode() */
struct
{
+ /* Reference to an entry of the AffixData field */
int affix;
+ /* Length of the word */
int len;
} d;
} p;
***************
*** 75,84 **** typedef struct spell_struct
#define SPELLHDRSZ (offsetof(SPELL, word))
typedef struct aff_struct
{
! uint32 flag:8,
! type:1,
flagflags:7,
issimple:1,
isregis:1,
--- 81,94 ----
#define SPELLHDRSZ (offsetof(SPELL, word))
+ /*
+ * Represents an entry in an affix list.
+ */
typedef struct aff_struct
{
! uint32 flag:16;
! /* FF_SUFFIX or FF_PREFIX */
! uint32 type:1,
flagflags:7,
issimple:1,
isregis:1,
***************
*** 106,111 **** typedef struct aff_struct
--- 116,125 ----
#define FF_SUFFIX 1
#define FF_PREFIX 0
+ /*
+ * AffixNode and AffixNodeData are used to represent prefix tree (Trie) to store
+ * an affix list.
+ */
struct AffixNode;
typedef struct
***************
*** 132,137 **** typedef struct
--- 146,161 ----
bool issuffix;
} CMPDAffix;
+ typedef enum
+ {
+ FM_CHAR,
+ FM_LONG,
+ FM_NUM
+ } FlagMode;
+
+ #define FLAGCHAR_MAXSIZE (1 << 8)
+ #define FLAGNUM_MAXSIZE (1 << 16)
+
typedef struct
{
int maffixes;
***************
*** 142,155 **** typedef struct
AffixNode *Prefix;
SPNode *Dictionary;
char **AffixData;
int lenAffixData;
int nAffixData;
CMPDAffix *CompoundAffix;
! unsigned char flagval[256];
bool usecompound;
/*
* Remaining fields are only used during dictionary construction; they are
--- 166,182 ----
AffixNode *Prefix;
SPNode *Dictionary;
+ /* Array of sets of affixes */
char **AffixData;
int lenAffixData;
int nAffixData;
+ bool useFlagAliases;
CMPDAffix *CompoundAffix;
! unsigned char flagval[FLAGNUM_MAXSIZE];
bool usecompound;
+ FlagMode flagMode;
/*
* Remaining fields are only used during dictionary construction; they are
*** a/src/test/regress/expected/tsdicts.out
--- b/src/test/regress/expected/tsdicts.out
***************
*** 191,196 **** SELECT ts_lexize('hunspell', 'footballyklubber');
--- 191,388 ----
{foot,ball,klubber}
(1 row)
+ -- Test ISpell dictionary with hunspell affix file with FLAG long parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_long (
+ Template=ispell,
+ DictFile=hunspell_sample_long,
+ AffFile=hunspell_sample_long
+ );
+ SELECT ts_lexize('hunspell_long', 'skies');
+ ts_lexize
+ -----------
+ {sky}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'bookings');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'booking');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'foot');
+ ts_lexize
+ -----------
+ {foot}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'foots');
+ ts_lexize
+ -----------
+ {foot}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'rebookings');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'rebooking');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'rebook');
+ ts_lexize
+ -----------
+
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'unbookings');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'unbooking');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'unbook');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'footklubber');
+ ts_lexize
+ ----------------
+ {foot,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'footballklubber');
+ ts_lexize
+ ------------------------------------------------------
+ {footballklubber,foot,ball,klubber,football,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'ballyklubber');
+ ts_lexize
+ ----------------
+ {ball,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_long', 'footballyklubber');
+ ts_lexize
+ ---------------------
+ {foot,ball,klubber}
+ (1 row)
+
+ -- Test ISpell dictionary with hunspell affix file with FLAG num parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_num (
+ Template=ispell,
+ DictFile=hunspell_sample_num,
+ AffFile=hunspell_sample_num
+ );
+ SELECT ts_lexize('hunspell_num', 'skies');
+ ts_lexize
+ -----------
+ {sky}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'bookings');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'booking');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'foot');
+ ts_lexize
+ -----------
+ {foot}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'foots');
+ ts_lexize
+ -----------
+ {foot}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'rebookings');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'rebooking');
+ ts_lexize
+ ----------------
+ {booking,book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'rebook');
+ ts_lexize
+ -----------
+
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'unbookings');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'unbooking');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'unbook');
+ ts_lexize
+ -----------
+ {book}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'footklubber');
+ ts_lexize
+ ----------------
+ {foot,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'footballklubber');
+ ts_lexize
+ ------------------------------------------------------
+ {footballklubber,foot,ball,klubber,football,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'ballyklubber');
+ ts_lexize
+ ----------------
+ {ball,klubber}
+ (1 row)
+
+ SELECT ts_lexize('hunspell_num', 'footballyklubber');
+ ts_lexize
+ ---------------------
+ {foot,ball,klubber}
+ (1 row)
+
-- Synonim dictionary
CREATE TEXT SEARCH DICTIONARY synonym (
Template=synonym,
***************
*** 277,282 **** SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
--- 469,516 ----
'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
(1 row)
+ -- Test ispell dictionary with hunspell affix with FLAG long in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ REPLACE hunspell WITH hunspell_long;
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ to_tsvector
+ ----------------------------------------------------------------------------------------------------
+ 'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
+ (1 row)
+
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ to_tsquery
+ ------------------------------------------------------------------------------
+ ( 'footballklubber' | 'foot' & 'ball' & 'klubber' ) | 'football' & 'klubber'
+ (1 row)
+
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+ to_tsquery
+ ------------------------------------------------------------------------
+ 'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
+ (1 row)
+
+ -- Test ispell dictionary with hunspell affix with FLAG num in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ REPLACE hunspell_long WITH hunspell_num;
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ to_tsvector
+ ----------------------------------------------------------------------------------------------------
+ 'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
+ (1 row)
+
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ to_tsquery
+ ------------------------------------------------------------------------------
+ ( 'footballklubber' | 'foot' & 'ball' & 'klubber' ) | 'football' & 'klubber'
+ (1 row)
+
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+ to_tsquery
+ ------------------------------------------------------------------------
+ 'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
+ (1 row)
+
-- Test synonym dictionary in configuration
CREATE TEXT SEARCH CONFIGURATION synonym_tst (
COPY=english
*** a/src/test/regress/sql/tsdicts.sql
--- b/src/test/regress/sql/tsdicts.sql
***************
*** 48,53 **** SELECT ts_lexize('hunspell', 'footballklubber');
--- 48,101 ----
SELECT ts_lexize('hunspell', 'ballyklubber');
SELECT ts_lexize('hunspell', 'footballyklubber');
+ -- Test ISpell dictionary with hunspell affix file with FLAG long parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_long (
+ Template=ispell,
+ DictFile=hunspell_sample_long,
+ AffFile=hunspell_sample_long
+ );
+
+ SELECT ts_lexize('hunspell_long', 'skies');
+ SELECT ts_lexize('hunspell_long', 'bookings');
+ SELECT ts_lexize('hunspell_long', 'booking');
+ SELECT ts_lexize('hunspell_long', 'foot');
+ SELECT ts_lexize('hunspell_long', 'foots');
+ SELECT ts_lexize('hunspell_long', 'rebookings');
+ SELECT ts_lexize('hunspell_long', 'rebooking');
+ SELECT ts_lexize('hunspell_long', 'rebook');
+ SELECT ts_lexize('hunspell_long', 'unbookings');
+ SELECT ts_lexize('hunspell_long', 'unbooking');
+ SELECT ts_lexize('hunspell_long', 'unbook');
+
+ SELECT ts_lexize('hunspell_long', 'footklubber');
+ SELECT ts_lexize('hunspell_long', 'footballklubber');
+ SELECT ts_lexize('hunspell_long', 'ballyklubber');
+ SELECT ts_lexize('hunspell_long', 'footballyklubber');
+
+ -- Test ISpell dictionary with hunspell affix file with FLAG num parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_num (
+ Template=ispell,
+ DictFile=hunspell_sample_num,
+ AffFile=hunspell_sample_num
+ );
+
+ SELECT ts_lexize('hunspell_num', 'skies');
+ SELECT ts_lexize('hunspell_num', 'bookings');
+ SELECT ts_lexize('hunspell_num', 'booking');
+ SELECT ts_lexize('hunspell_num', 'foot');
+ SELECT ts_lexize('hunspell_num', 'foots');
+ SELECT ts_lexize('hunspell_num', 'rebookings');
+ SELECT ts_lexize('hunspell_num', 'rebooking');
+ SELECT ts_lexize('hunspell_num', 'rebook');
+ SELECT ts_lexize('hunspell_num', 'unbookings');
+ SELECT ts_lexize('hunspell_num', 'unbooking');
+ SELECT ts_lexize('hunspell_num', 'unbook');
+
+ SELECT ts_lexize('hunspell_num', 'footklubber');
+ SELECT ts_lexize('hunspell_num', 'footballklubber');
+ SELECT ts_lexize('hunspell_num', 'ballyklubber');
+ SELECT ts_lexize('hunspell_num', 'footballyklubber');
+
-- Synonim dictionary
CREATE TEXT SEARCH DICTIONARY synonym (
Template=synonym,
***************
*** 94,99 **** SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footb
--- 142,163 ----
SELECT to_tsquery('hunspell_tst', 'footballklubber');
SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+ -- Test ispell dictionary with hunspell affix with FLAG long in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ REPLACE hunspell WITH hunspell_long;
+
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+
+ -- Test ispell dictionary with hunspell affix with FLAG num in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ REPLACE hunspell_long WITH hunspell_num;
+
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+
-- Test synonym dictionary in configuration
CREATE TEXT SEARCH CONFIGURATION synonym_tst (
COPY=english