Related InformationRelated Information

Downloadable DocumentsDownloadable Documents

Contact UsContact Us

Keep up-do-date with all things Grapeshot, Click Here to register your details and we'll keep you informed.

Tel:
+44 (0)1223 311319
Email:
info@grapeshot.co.uk

BenchmarksPorter Stemmers

Grapeshot includes twelve Porter stemmers as a language option in the release of the Grapeshot SDK. The ability to process words or tokens in other languages as part of the Summarisation, Personalisation, Categorisation and Conceptual Search techniques within the iSA (Intelligence Search Architecture) Toolkit is improved by the use of language specific stemming.

Grapeshot ships with the following Porter stemmers:

  • English, French, Spanish, Portuguese, Italian, German, Dutch, Swedish, Norwegian, Danish, Russian and Finnish

Other languages such as Chinese, Arabic and Korean are supplied via third parties, and are not part of the Porter Stemmer heritage.

Stemming Algorithms

Grapeshot makes, or is able to make, full use of the Snowball stemming algorithms (Porter, 2001), and the internal scripting language of Grapeshot is a superset of the Snowball language, invented for defining stemming algorithms. But Grapeshot is not merely a vehicle for putting stemming algorithms into practice. It is worth looking at stemming algorithms as a whole, and seeing how they fit into the IR model on which Grapeshot is built.

As an IR technique, automatic stemming is an old idea (going back to at least the 1960s), but is nevertheless still controversial. The effectiveness of stemming has been challenged experimentally; the big internet search engines do not, as a rule, use stemming; stemming is perhaps inferior to lemmatization. Let us look at these three issues in turn.

Stemming Evaluation

The snowball stemmers include an English stemmer, which is the Porter stemmer (Porter, 1980) plus a small set of improvements. Lennon (1981) did tests on a range of term conflation techniques that were, for the Porter stemmer at least, encouraging. Eight techniques were compared. The Porter stemmer was out-performed by one other technique at the high-precision end, and performed best (equal with another technique) at the high-recall end. The differences in performance were, however, very small, and usually not statistically significant. A similar result is reported in Porter (1980), where the new Porter algorithm was shown to perform slightly better than an earlier and more complex algorithm, but not to a degree which was statistically significant.

But later work by Harman (1980), working on different test collections, reported that stemming yielded no improvement. She merely concluded that ‘the number of queries with improved performance tended to equal the number with poorer performance, resulting in little overall change for the entire test collection.’

Later still, work by Krovetz (1995) reverted more to Lennon’s position. He found that stemming yielded improvements across a range of test collections, with the degree of improvement varying considerably between different collections. The improvements were small, but he judged them as significant.

What is one to make of all this?

It is important to recognise the limits of what can be discovered through a traditional IR test. An IR test collection comprises a set of queries, a set of documents, and, for each query, the set of relevant documents for that query, as judged by some panel of experts. An IR test measures how well the ranking of documents got by running the queries bring back the documents pre-judged as relevant. Test collections are very important in IR research, but it is very difficult to use them to measure the effectiveness of an interactive process between user and system designed to improve retrieval performance. And it is in such processes that stemming becomes important. Stemming regularises the vocabulary that is presented to the user. This gives the user confidence in the indexing process and makes word choice for query expansion easier. Stemming amalgamates batches of words for running in large queries, queries which are best built up through an interactive process.

It is important therefore not to make a final judgement on stemming based only the results of IR experimentation.

Internet Search Engines

At the time of writing (2004), the search engine space is dominated by Google, although that may not be the case in the future, any more than it was in the past. The total number of search engines (Lycos, Altavista, Hotbot, msn, the Euroferret ...) has been very large, but with a few exceptions, they have all conformed to a common functional pattern. The remarks below, it must be emphasised, are not intended as a disparagement of these systems.

Search engine queries tend to be very simple — usually less than four words. As a rule, only documents indexed by all the query terms are returned, which itself discourages the formulation of queries with many terms. Any IR system must find a balance between precision and recall. Generally speaking, high recall corresponds to low precision, and low recall to high precision. In the fiercely competitive world of internet search engines, the emphasis is on precision. This is because users see precision at once in the first page of retrieved items, but recall can only be judged by what is missing, and this is invisible. Besides, the internet is so large, and the quality of material so mixed, that high recall searches are not expected. These systems will not use stemming, which can interfere with precision, especially in short queries.

One can give an example here. The single query titanic might be expected to give results about (a) the Titanic disaster of 1912, or (b) one of the films made about the event. The single query titan might be expected to give results about (c) the Titans of Greek mythology, or (d) the moon of Saturn. The Porter stemmer equates titan and titanic, and then a single query titanic would pull up all categories, (a), (b), (c) and (d) together, thereby decreasing precision. But if longer queries are used in a fully probabilistic system there is no problem,

  • (a) titanic disaster
  • (b) titanic, with leonardo dicaprio and kate winslet
  • (c) greek myth titans
  • (d) titan saturn's moon

The functionality of internet search engines therefore discourages stemming. Long queries run slower than short ones, and match algorithms that only deliver documents indexed by all the query terms run faster than algorithms that rank all the documents indexed by any query term. Internet search engines need to answer as many incoming queries per second as possible, and the technology adopted therefore encourages queries to be kept short.

Furthermore, users become so familiar with these systems, they expect any encounter with a search capability to work in the same way. This further discourages use of stemming.

Lemmatization

Algorithmic stemmers make errors in their attempt to remove endings — such is the nature of natural languages. The errors can be eliminated, or at least greatly reduced, if the words are instead looked up in a computerised dictionary, a process that can also give the exact part of speech of the normalised form of the word looked up. This alternative approach may be called lexical stemming, or lemmatization.

For a thorough investigation into algorithmic versus lexical stemming, see Krovetz (1995), plus Krovetz (1993 and 1994). Here we confine ourselves to a few points.

Lemmatization may include something like a stemming process, since, unless the dictionary contains all inflected forms of words, a word will have to undergo some sort of normalisation process before it can be looked up in the dictionary. This is especially true for languages in which proper names, which will not be found in the dictionary, also inflect — Russian and Finnish for example, and classical Latin.

A dictionary can also be used to tackle compound word division, a problem that arises in German, Dutch and the Scandinavian languages. For example, in German, Pfarrerweiterbildungskommission = Pfarrer + weiter + bildungs + kommission, (‘Commission for the further education of priests’)

With the rise of commercially available lemmatization systems, and the availability of algorithmic stemmers, a good deal of work is currently being done comparing the two. A final verdict is still in the future, but meanwhile algorithmic stemmers do perform creditably. For example, Tomlinson (2003) compares the Snowball stemmers with a commercial lexical stemming (lemmatization) system. Of the nine languages tested, six gave differences that were not statistically signifant, two did better under the lemmatization system, and one better under Snowball.

One would expect lemmatization systems, ultimately, to perform better than algorithmic stemmers, but algorithmic stemmers have the obvious advantage of not requiring use of a large dictionary, with all the problems of distribution, maintenance and updating (with language change). Certainly the small Snowball stemmers fit in very well with the Grapeshot idea of lightweight solutions for the problems of IR.

Approaches to Stemming

Stemming processes may be described as a set of rules, but with no account of the software for driving the rules, or as a piece of software, with just a sketch of the rules the software is intended to implement. The Schinke stemmer for Latin is an example of the former (Schinke, 1996 and 1998); the Kraaij-Pohlmann Dutch stemmer is an example of the latter (Kraaij, 1994 and 1995). Often, the use of a stemmer is announced in a publication, but no description given. This can be very tantalising.

Authors of stemmers often follow a rule structure as an aid in developing the software, but problems can arise when language features are not readily describable in the rule structure. For example, a common model is the finite state machine (Honrado, 2000, Fox 2001, Da Silva 2001), but it is hard to represent in this model the idea of, say, removing a suffix if the residual stem length exceeds a certain number of syllables.

As far as English is concerned, the main stemmers have been the Lovins stemmer (Lovins 1968), the Porter stemmer (Porter 1980) and the Paice stemmer (Paice 1990). Lovins follows a rule structure which in its main outline is rather rigid, but had a lot of internal flexibility. Paice has a set of rules that are simple drivers for the stemmer program itself. Porter allows the rules to develop from the suffix structure of English. The original program of the Porter stemmer was in BCPL, and was distributed on rquest, but since then it has been rewritten many times in many different languages.

Other approaches to stemming are possible, for example, finding word similarity through n-grams (Robertson, 1998), or looking at the similarity of adjacent words in a sorted word list (Makagonov, 2002).

The Snowball Approach

Snowball is a scheme for developing and defining stemmers. Instead of using a rigid rule structure for defining the stemmers, a language has been invented, Snowball, in which to represent the rules that naturally emerge as the stemmer is developed. A stemmer for a language is then defined in three ways: by (a) written description, (b) a Snowball program, and (c) a representative vocabulary of words together with their stemmed forms. Snowball has been very successful in exactly defining, and widely distributing, stemming algorithms.

Unfortunately the Snowball project has not led to the submission of language stemming algorithms from other IR researchers. This may be because they are not that easy to devise, requiring as they do software and linguistic skills, as well as a central interest in IR.

Certainly stemmers for east European languages and for Arabic would be very useful. On the other hand Snowball has become the main, if not the only, repository of readily available stemming algorithms in use in the world today.

References

Brian Fox and Christopher J. Fox (2001) Efficient stemmer generation. Information Processing and Management, 38 :547?558.

Donna Harman (1991) How effective is suffixing? Journal of the American Society for Information Science, 42 :7?15.

Asunción Honrado, Ruben Leon, Ruairi O’Donnel and Donald Sinclair (2000) A word stemming algorithm for the Spanish language. IEEE :139?145.

W. Kraaij W and R. Pohlmann (1994) Porter’s stemming algorithm for Dutch. In L.G.M. Noordman and W.A.M. de Vroomen WAM, editors, Informatiewetenschap 1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg, 1994. :167?180.

W. Kraaij and R. Pohlmann (1995) Evaluation of a Dutch stemming algorithm. In J. Rowley, editor, The New Review of Document and Text Management, volume 1, London: Taylor Graham, :25?43.

Bob Krovetz (1993) Viewing morphology as an inference process. In Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, :191?202.

Bob Krovetz (1994) Learning to augment a machine readable dictionary. In Proceedings of Eurolex, :107?116.

Bob Krovetz (1995) Word sense disambiguation for large text databases. PhD Thesis. Department of Computer Science, University of Massachusetts Amherst.

Martin Lennon, David S. Peirce, Brian D. Tarry and Peter Willett (1981) An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3 :177?183.

Julie Beth Lovins (1968) Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11 :22?31.

Pavel Makagonov and Mikhael Alexandrov (2002) Empirical formula for testing word similarity and its application for constructing a word frequency list. In A. Gelbukh, editor, Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics Springer :425?432.

Pavel Makagonov and Mikhael Alexandrov (2002) Constructing empirical formulas for testing word similarity by the inductive method of model self-organisation. In Elisabete Ranchhod and Nuno Mamede, editors, Lecture Notes in Artificial Intelligence Springer :239?247.

C.D. Paice (1990) Another stemmer. SIGIR Forum, 24(3) :56?61.

Martin Porter (1980) An algorithm for suffix stripping. Program, 14 :130?137.

Martin Porter (2001) Snowball: A language for stemming algorithms.

Alexander M. Robertson and Peter Willett (1998) Applications of n-grams in textual information systems. Journal of Documentation, 54(1) :48?69.

Robyn Schinke, Mark Greengrass, Alexander M. Robertson and Peter Willett (1996) A stemming algorithm for Latin text databases. Journal of Documentation, 52 :172?187.

Robyn Schinke, Mark Greengrass, Alexander M. Robertson and Peter Willett (1996) Retrieval of morphological variants in searches of Latin text databases. Computers and the Humanities, 31 :409?432.

Gilberto Ferreira da Silva (2001) Representação do léxico para reconhecimento da similaridade de palavras no português. Master’s Dissertation. Instituto Militar de Engenharia, Ministerio da Defesa, Rio de Janeiro.

Stephen Tomlinson (2003) Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServer(TM) at CLEF 2003. In Carol Peters, editor, Working notes for the CLEF 2003 Workshop 21?22 August, Trondheim, Norway.