Sciences

Rating
Date

December 1969
Size

154.8KB
Views

627
Categories

Others

Preview only show first 6 pages with water mark for full document please download

Transcript

Lightweight Morphology: A Methodology for Improving Text Search TAPoR “The Face of Text” conference, Nov. 19-21, 2004, McMaster University, Hamilton, Ontario, Canada Mika¨l Roussillon Bradford G. Nickerson Stephen Green William A. Woods e Faculty of Computer Science University of New Brunswick PO Box 4400 Fredericton, N.B. E3B 5A3, CANADA {mikael.roussillon, bgn}@unb.ca Sun Microsystems Laboratories 1 Network Drive Burlington, MA 01803, USA {stephen.green, william.woods}@sun.com October 29, 2004 Abstract Lightweight Morphology is a new approach to morphological analysis, creating morphological variants from sets of rules. The rules are intuitive to deﬁne and at the same time, oﬀering expressiveness and control. We deﬁned a grammar for Lightweight Morphology. We deﬁned how to generate English and French morphological variants using the grammar. French Language speciﬁcation required 526 rules, 41 rule sets and 16,842 exception table words, while the English language speciﬁcation took 123 rules, 17 rule sets and 2,589 exception table words. English and French Lightweight Morphologies were compared with two other techniques extending queries on a collection of 533 documents from the aligned Hansard of the 36th parliament (1997) of Canada. A diﬀerential recall comparison among the techniques showed that Lightweight Morphology has more queries (average of 3.9 times more) retrieving fewer irrelevant document for both English and French. The French Lightweight Morphology has more queries (average of 2.5 times more) retrieving more relevant documents. 1 Introduction Interpretation of a text can be assisted by knowing the location and content of ‘like’ phrases and words using morphological variants. The challenge of using morphological variants is how to generate them. One can (a) have a complete lexicon for each language that includes all morphological variants, or (b) encode morphotactics as rules, so as to generate the variants of recognized morphological patterns [5, 8]. For example, in French, the ending ’erait’ as in ‘aimerait’ identiﬁes the conditional 3rd person singular of a ﬁrst group verb, and we can produce the morphological variants corresponding to the ﬁrst group verb paradigm, such as ‘aimer’, ‘aimes’, ‘aimions’ and many more. 1 Natural language specifications (rules) en fr ... es Existing words in the collection Indexed collection not in the collection not retrieved retrieved Query word Lightweight Morphology rule processor Lightweight Morphology engine Non−existing words Morphological variants not retrieved Filter Figure 1: Lightweight Morphology variants production with rule speciﬁcations and ﬁltering against the collection (en stands for English, fr for French, es for Spanish). 2 Lightweight Morphology Lightweight Morphology is a methodology that follows the second approach. This approach is space-eﬃcient and avoids the complexity of establishing a complete lexicon of all morphological variants. As a pre-processing step, Lightweight Morphology tries to expand the word into multiple morphological variants, therefore adding information to the word (see Figure 1). Lightweight Morphology consists of three components which are (1) a set of pattern-matching rules enclosed in rule sets that will produce morphological variants, (2) an escape mechanism to write rules in JavaTM programming language and (3) an exception table to handle exceptions that exist in a language. Lightweight Morphology has a modular approach that enables one to deﬁne morphologies for diﬀerent languages or to introduce new approaches on the way the morphological variants are to be produced. 2.1 Pattern-matching rules and rule sets A pattern-matching rule is made of the following elements: 1. On the left hand side of the rule, a pattern the word has to match in order to apply the rule. The pattern is deﬁned with regular expressions, and some mechanism is provided to interact with the right hand side. 2. On the right hand side of the rules, a list of morphological variations that are to be applied in order to obtain the variants. 3. The left hand side is separated from the right hand side with the ‘->’ production symbol. The operators of the pattern-matching rules allow to handle diﬀerent kinds of aﬃxes, like circumﬁxes and inﬁxes, either by removal from the input word or by addition to the produced variants. Diphthongs and spelling modiﬁcations can also be handled. Figure 2 presents a simple example of a pattern-matching rule for English processing words ending with a vowel followed by less. The word is stripped from less and the list of morphological variation from the right 2 hand side of the rule are appended to form the morphological variants (‘ ’ stands for the empty morphological variation). Figure 3 shows a more complex example to create the inﬁnitive forms for French verbs like assi´ger. It is supposed that the input to this rule is a verb stripped from e its ending. .aeiou + l e s s -> _,s,er,ers,est,ed,ing,ings,ly,ness,nesses,ment,ments,ful; Figure 2: Pattern-matching rule example for English words ending with less. .$Letter <´|`> g + ?e -> <´>/_er; e e e Figure 3: Pattern-matching rule example for French inﬁnitive form creation. The pattern-matching rules are themselves ordered within a rule set. The ﬁrst rule that will be successful in the rule set (i.e. the pattern matches the word and the right hand side produces variants) will return the morphological variants obtained and no more rules will be tried. Diﬀerent kind of rule sets can be used. The ﬁrst one is the default rule set, the one that will be ﬁrst tried when performing Lightweight Morphology. The ending rule set oﬀers a diﬀerent behaviour. It encloses rules that will apply on words having the same ending. Therefore, if an input word has an ending deﬁned by an ending rule set, this speciﬁc rule set will be tried instead of the default rule set. Figure 4 presents an ending rule set containing the only rule processing English words ending with -ical. The third kind of rule sets are JavaTM programming language rule sets and will be explained in the next section. The fourth kind is a rule set whose sole purpose is to group a set of pattern-matching rules. RULESET icalRules ENDING ical { .aeiouy + i c a l -> ic,ics,ically; // (e.g., academical, electrical, theatrical) } Figure 4: Example of ending rule set enclosing a rule for words ending in ical. Each rule set can be called from a pattern-matching rule, the word matched and modiﬁed by the pattern-matching rule being the input word to the rule set. The usefulness of rule sets is therefore to group similar processing, either because of their ending which represents a certain morphological feature (e.g. the -s ending for plural), or because of the same problem they solve (e.g. creating from a speciﬁc kind of verb the present forms). 2.2 User-deﬁned rules using JavaTM programming language Sometimes, pattern-matching is not enough to describe morphotactics and create morphological variants. For those cases, Lightweight Morphology oﬀers the possibility to write rules directly in JavaTM programming language to create user-deﬁned rules. These rules behave much like rule 3 sets since pattern-matching rules can call user-deﬁned rules and user-deﬁned rules can call rule sets. Diﬀerent user-deﬁned rules can co-exist by creating diﬀerent JavaTM programming language rule sets. 2.3 Exception table Every language has exceptions or irregularities, and some are so exceptional that they cannot be encoded as rules. Most of the time, those exceptions are intensively used words that earned a very irregular pattern over the centuries. Classical examples that apply to many languages are ‘to be’ and ‘to go’. Lightweight Morphology provides a mechanism to write this kind of exception. The exception table can be seen as a kind of lexicon where morphological inﬂections and derivations of a word a grouped together as an entry. Figure 5 presents a small portion of the exception table for English words. EXCEPTIONS { all; also; find, finds, found, finding, findings, finder, finders; found, founds, founded, founding, foundings, founder, founders; } Figure 5: Exception table for some English words. Before processing an input word with rules, Lightweight Morphology ﬁrst checks if the word appears in the exception table. If so, all entries containing the word are returned and no rules are tried. The exception table can also be used for words where rules produce incorrect morphological variations (morphological variations that are real words but are not related to the input word), or for words that do not require processing by rules, like adverbs. 3 Lightweight Morphology speciﬁcation The essential work of Lightweight Morphology is to deﬁne pattern-matching rules for natural languages. Depending on the language and the person creating the speciﬁcation, various approaches can be taken to construct a Lightweight Morphology speciﬁcation a natural language. We present two approaches, one for English and one for French. Use of grammatical ressources, like Bescherelle [1], Bled [2] or Grevisse [4] for French, is strongly suggested. 3.1 English The English Lightweight Morphology has 123 rules, 17 rule sets and 2,589 exception table words. The rule sets are ending rule sets, except for the default rule set. The processing is therefore performed depending on the ending of the input word. Example of endings processed are -s, 4 -less or -ing, handling noun and verbs inﬂection and some derivations. The default rule set only processes endings that are not deﬁned as an ending rule set. Heavy use of rule set calls is made, for example in some pattern-matching rules, words stripped from -s will go again through the rules as a new input word and create new morphological variants. The exception table is an important part of the English Lightweight Morphology. It encodes irregular verbs (e.g. ﬁnd) or words that fail to be processed by the rules (e.g. ﬁrmament). On the set of 200 English words processed by the English Lighweight Morphology, 4% produced irrelevant variants (e.g. billy from bill) and 16% are did not produce one or more important variant (e.g. legislate is not produced from legislation because of an unhandled derivation). 3.2 French The French Lightweight Morphology has 526 rules, 41 rule sets and 16,842 exception table words. All rule sets are normal rule sets, except for the default rule set. Each normal rule set will perform a speciﬁc grammatical feature. For example, a rule set is used to process masculine adjectives and nouns, while another rule set is used to create the inﬁnitive form for 1st group verbs. The default rule set dispatches the words to the appropriate rule set according to a pattern. Support for derivation for the suﬃxes -age, -ance, -ment, -eur, -ion, and -able is provided creating morphological variants from and to these suﬃxes. The exception table is extensively used to encode irregular verbs (e.g. aller, to go) and irregular form of adjectives (e.g. beau, beautiful) and nouns (e.g. carnaval, carnaval). The pattern-matching rules, along with the exception table, are supposed to handle inﬂections for verbs, nouns and adjectives. On a set of 200 French words, 5% produced incorrect variants (e.g. paye from pays), and 7,5% did not produce one or more important morphological variant (e.g. rapporter is not produced from rapport). 4 Testing We tested the Lightweight Morphology approach with 200 English query words and 200 French query words in the aligned Hansard of the 36th parliament of Canada [3] which consists of 533 documents (19,999,604 tokens for the English version, 22,801,063 tokens for the French version). The English version contains 55,323 terms and the French one 76,031 terms. By term we mean a token that does not contain a cipher. We compared Lightweight Morphology with two other search approaches: (a) stemming and (b) wildcards. For example the word ‘academic’ produces ‘academics’, ‘academical’, ‘academically’ for Lightweight Morphology and ‘academ’ for the Porter stemmer. A good wildcard query would be ‘academ*’ or ‘academic*’. We included the number of documents retrieved with exact query (no pre-processing) as a reference measure, because it is common among search engines. The Porter stemmer [6] algorithm is used for English, a stemmer created by Martin Porter [7] is used for French. For the French Lightweight Morphology, we produced two sets of rules: (1) one that performs inﬂections and derivations and (2) one that only performs inﬂections. For each language, 100 query words are randomly selected from an Ispell dictionary. If a word does not have a hit in the collection with one of the approaches, it is discarded. The remaining 100 query words are frequency selected from the Hansard. We sorted the words of the collection 5 by frequency and selected the ﬁrst 100 meaningful words. We consider that a meaningful word is a word that is a uncommon word. Common words can be an article (e.g. the, a), a common verb (be, go) or any word that would not make much sense in a query as a single word (e.g. however, ﬁrst). To evaluate Lightweight Morphology against the other approaches, we used a qualitative measure known as diﬀerential recall. In diﬀerential recall, we compare two methods A and B and calculate: • A ∩ B — The number of relevant documents found with variants from both A and B; • A − B (resp. B − A) — The number of relevant documents found by variants from A (resp. B) but not from B (resp. A). We decided that our relevance criterion would be the correctness of the variants found (e.g. for the query ‘tear’, we consider ‘torn’ to be relevant but not ‘tearmann’), without consideration for the context (e.g. we consider ‘tear’ to be relevant both in the context of a teardrop and of something that is torn). For the 200 words of each language, we manually classiﬁed the variants produced by the three approaches as either relevant or irrelevant according to our relevance criterion. 5 Results The following abbreviations will be used: LM (Lightweight Morphology), LM’ (Lightweight Morphology without derivation processing), S (Stemmer), W (Wildcard) and EQ (Exact Query). 5.1 Relevance classiﬁcation and diﬀerential recall Some results for three selected English query words are presented in Tables 1, 2 and 31 . The results present a good sample of words retrieved by the approaches. The relevance/irrelevance classiﬁcation is not perfect in part because of the collection that contains typos, spelling errors, neologisms and foreign words (many French words in the English Hansard and vice-versa). Judging a word to be relevant is often subjective, and our classiﬁcation can be contested for some queries. For those few contestable classiﬁcations, we believe the impact on the diﬀerential recall is minor. 5.2 Diﬀerential recall win-lose score In order to compare diﬀerential recall among all the words and all the approaches, we introduced a win-lose score, based on the ‘diﬀerence’ score (A − B and B − A) from diﬀerential recall measures. For the relevance criterion, if A − B > B − A for a query, A is awarded 1 point (A is retrieving more relevant document than B on the query). For the irrelevance criterion, if A − B > B − A for a query, B is awarded 1 point (B is retrieving less irrelevant documents than A). The results of the win-lose scoring is given in Table 4 for the randomly selected words and in Table 5 for the frequency selected words. 6 Table 1: Variants found with Lightweight Morphology, Stemming and Wildcard. Query Approach LM S W LM S W LM tear(85) (tear*) S W Example of English variants found in Hansard (number of documents found with this variant, relevance) artist(85, rel), artists(136, rel) artist(85, rel), artists(136, rel), artistes(2, rel), artistic(46, rel) artist(85, rel), artists(136, rel), artistes(2, rel), artistic(46, rel), artistically(2, rel), artistique(1, rel), artistry(4, rel) leaf(99, rel), leafs(21, rel), leaﬁng(2, rel), leaves(299, rel) leaf(99, rel), leafs(21, rel), leaﬁng(2, rel) leaf(99, rel), leafs(21, rel), leaﬁng(2, rel), leaﬂet(3, irrel), leaﬂets(5, irrel) tear(85, rel), tears(100, rel), tore(17, rel), torn(97, rel), tearing(40, rel) tear(85, rel), tears(100, rel), teared(1, rel), tearful(2, rel), tearing(40, rel) tear(85, rel), tears(100, rel), teared(1, rel), tearful(2, rel), tearfully(1, rel), tearing(40, rel), tearmann(1, irrel) artist(85) (artist*) leaf(99) (leaf*) Table 2: Diﬀerential recall of Lightweight Morphology vs. Stemming for English. Query Word artist leaf tear LM − S 0 223 55 LM ∩ S 153 110 182 S − LM 16 0 1 Table 3: Diﬀerential recall of Lightweight Morphology vs. Wildcard for English. Query Word artist (artist*) leaf (leaf*) tear (tear*) LM − W 0 223 55 LM ∩ W 153 110 182 W − LM 19 0 1 7 Table 4: Diﬀerential recall Win-Lose scores for 100 randomly selected words. Language English English French French Criterion Relevant Irrelevant Relevant Irrelevant LM - S 24 4 35 5 24 1 13 2 LM’ - S N N 25 5 A A 36 1 LM - W 8 39 11 26 44 0 33 1 LM - EQ 82 0 90 0 0 1 0 2 S - EQ 81 0 93 0 0 4 0 5 Table 5: Diﬀerential recall Win-Lose scores for 100 frequency selected words. Language English English French French Criterion Relevant Irrelevant Relevant Irrelevant LM - S 11 21 17 20 17 4 7 5 LM’ - S N N 12 19 A A 13 3 LM - W 8 63 23 50 33 2 14 7 LM - EQ 71 0 73 0 0 9 0 11 S - EQ 73 0 70 0 0 25 0 23 First, we can see the usefulness of morphological analysis by comparing LM to EQ and to W queries. Compared to EQ, LM provides more relevant documents on 70 to 90 queries but introduces irrelevant documents on 1 to 11 queries. Most of the time, performing morphological analysis on terms will introduce relevant documents, and in a few cases, irrelevant documents. When comparing to W, we can see the advantage of performing morphological analysis: the user does not need to think about the query. Wildcard queries generate comparatively more irrelevant words. For example, pity creates the query pit* so it will retrieve unrelated words like pitbull. It is easy to often retrieve more relevant documents with a straightforward wildcard query; it is also easy to retrieve more irrelevant documents. Morphological analysis has a strong advantage since the user does not need to think about putting wildcard operators at the right place to ensure retrieval of many related and few unrelated documents. Compared to S, English LM is better, not always by retrieving more relevant documents but by always retrieving fewer irrelevant documents. We can see that for English, LM has a small advantage over S on the ﬁrst query set by retrieving fewer irrelevant documents in 4 queries and more in only 1. On the second query set, S retrieves more relevant documents on 17 queries (11 for LM). LM retrieves fewer irrelevant documents: in 21 queries, S retrieves more irrelevant documents and retrieves less irrelevant documents in only 4 queries. The Porter Stemmer for example, found the stem intern from international allowing variants such as internalize or interned. This illustrates that English LM can improve information retrieval systems by decreasing the number of irrelevant documents retrieved on a per query basis. French LM performs better than S in every case, getting more queries retrieving more relevant documents and more queries retrieving fewer irrelevant documents. French Lightweight As documents can be retrieved by more than one variant, results from Table 1 do not necessarily translate to the numbers in Tables 2 and 3 1 8 Morphology does not ﬁnd verbs from nouns ending with a consonant (e.g. accorder from accord or travailler from travail) reducing its relevant retrieval eﬃciency. Processing derivations in French is important. Ignoring the suﬃx derivations decreases the relevant documents retrieved, making LM’ less eﬃcient than S for returning relevant documents. The number of queries retrieving fewer irrelevant documents is almost unchanged, illustrating that processing derivation does not appear to include spurious variants. 6 Conclusion Lightweight Morphology is a new approach on morphological analysis oﬀering more control than stemming over the variants retrieved. The way Lightweight Morphology is deﬁned is intuitive compared to the way stemming algorithms are deﬁned. The consequence is that English Lightweight Morphology will outperform stemming by retrieving less irrelevant documents on a per query basis. Stemming has a small advantage at retrieving more relevant documents on a per query basis, essentially because the English Lightweight Morphology does not process some derivations. The French Lightweight Morphology was superior in both having more queries retrieving more relevant documents and fewer irrelevant documents. Future research for Lightweight Morphology involves creating speciﬁcations for other natural languages and seeing where to improve the limits of pattern-matching rules. JavaTM programming language rules should be reworked and replaced by a speciﬁc language performing advanced string manipulation. Finally, other tests should be performed on other collections and with other query words. 7 Acknowledgements Sun Microsystems Inc. is gratefully acknowledged for providing funding for this research. Thanks are also due to the University of New Brunswick Faculty of Computer Science for their support. Experiments for this research were carried out on equipment established with funding from the Canada Foundation for Innovation (TAPoR project). References [1] Bescherelle : La Conjugaison pour tous. Hatier, Paris, 1997. [2] E. Bled and O. Bled. Bled : orthographe-grammaire. Hachette, Paris, 2003. [3] Ulrich Germann. Aligned Hansards of the 36th parliament of Canada release 2001-1a (proceedings from September 25, 1997). Resources available at http://www.isi.edu/ natural-language/download/hansard/. [4] M. Grevisse. Le Bon Usage: Grammaire fran¸aise avec des remarques sur la langue fran¸aise c c d’aujourd’hui. Duculot, Paris, 11`me edition, 1980. e 9 [5] R. Krovetz. Viewing morphology as an inference process. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Linguistic Analysis, pages 191–202, 1993. [6] M. F. Porter. An algorithm for suﬃx stripping. Program, 14(3):130–137, 1980. [7] M.F. Porter. French stemmer. http://snowball.tartarus.org/french/stemmer.html. [8] W. A. Woods. Aggressive morphology for robust lexical coverage. Technical Report TR-99-82, Sun Microsystems, 1999. 10

Sciences

Share

Rating

Date

Size

Views

Categories

Transcript

Select your language