Related InformationRelated Information

Contact UsContact Us

Keep up-do-date with all things Grapeshot, Click Here to register your details and we'll keep you informed.

Tel:
+44 (0)1223 311319
Email:
info@grapeshot.co.uk

Language SolutionsLanguage Solutions

Grapeshot talks many languages
Grapeshot talks many languages

Grapeshot has a particular strength for languages.

Grapeshot is Unicode UTF8 compliant through-out and has specialist routines that do not only automatically detect the languages in use, by reference to character sets, but that are also able to apply the appropriate word-splitting, character separation and stemming routines to optimise both indexing and search processes.

Grapeshot has been hand-crafted to work with a wide range of European languages, as well as Arabic, Chinese, Korean and Japanese.

Cleaning Up The Data

There are many publishers who pride themselves in the quality of their content - yet the electronic versions of their files are often littered with incorrect character notation. In electronic texts the same symbols of print can be represented in many different ways, for example the Ñ in Spanish. The Grapeshot indexing scripts are ideal for accepting all varieties of input and massaging them into a standard form. Publishers have used Grapeshot to help clean up existing publisher electronic assets.

Crawling For Language

The Grapeshot Epicentre Crawler is a specific product to help online aggregators crawl remote webpages and slice up the source HTML. Often HTML pages include not just distracting advertisements, but also navigation bars and navigation links on the left, top, right hand side and bottom elements of the page. Epicentre is a Grapeshot application which decides where the meaningful value of content sits on a page, what we call the "epicentre" of a document, and uses this to determine the best section of the HTML page which is actually the useful content!

Epicentre customers have required this method of automatically extracting out the "valuable bits" of remote webpages to scale to other languages. Rather than build taxonomies or rules for each language, Grapeshot just switched on its foreign language routines and the probability maths of Grapeshot's WordRank went to work on foreign character set combinations just as easily has it had originally done in English.

Arabic Source Document

BEFORE: Arabic source HTML document with navigation bar and "page noise"

Epicentre extracts core text from page noise

AFTER: Grapeshot has extracted the core text from the page noise using the Epicentre Crawler

Grapeshot's power is to easily migrate an information solution to all major languages of the world, without the need to invest in language specific thesauri, taxonomies or semantic webs. It can scale in an instance!