Related InformationRelated Information

Contact UsContact Us

Keep up-do-date with all things Grapeshot, Click Here to register your details and we'll keep you informed.

Tel:
+44 (0)1223 311319
Email:
info@grapeshot.co.uk

View the Showreel

FAQsFAQs

How many documents can you index?

The technical limit is "2 to the power 47" documents in one single Grapeshot index, with the ability to search 10, 100 or 1000 indexes simultaneously in a distributed fashion. In practice the limiting factor is the speed at which data can itself be read off a hard-drive spindle. On today's cheap hardware (less than $1,000 dollars) any index file larger than 20GB or 30GB is noticeably slower to read the information: depending on the amount of RAM memory also available. Some Grapeshot indexes only store 1.5kb per document, so it is reasonable on hold 10 or 20 million documents in a single Grapeshot index, and run larger systems of 100 million documents by using several indexes on different servers, as a combined federated search system. The Grapeshot algorithms are fast: it is the disc read speeds that are the limiting factors in any hardware architecture design.

What operating systems do you support?

Grapeshot can be compiled to any operating system that supports ANSI C. We currently build new versions of code for Windows and Linux platforms and FreeBSD. We can supply MacOS and Sun Solaris (and all other UNIX variants) upon request.

How fast can it index?

Grapeshot can index documents in 4 or 5 milliseconds, where the document is often several hundred words and has 20 or more fields of information. Grapeshot has a unified indexing method for all foreign languages, with appropriate stemming algorithms and word segmentation techniques. Any text encoding scheme can be handled, but UTF-8 encoded Unicode is standard.

Do you support multi-threading?

Grapeshot does not use multi-threading in the core code, but gives opportunities for multi-threading techniques to be used with benefit through its API. For example in Java each of several JNI threads can invoke a separate Grapeshot session. Grapeshot uses a small memory footprint (<300k) which often allows for many sessions to run on a single CPU.

How to you handle UTF-8?

Grapeshot has a very professional approach to a multitude of character sets. Grapeshot indexing routines identify the character set in use within a document and introduce appropriate stemming routines as part of tokenising the words or phrases within the incoming text. Tokenisation includes word splitting or character separation, as well as dealing with the ideosyncracies of punctuation within each language. Grapeshot's UTF-8 compliance works for the complete Unicode character set - but specialist tokenisation and stemming routines are available for English, Arabic, Chinese, Japanese, Korean, French, Spanish, Portuguese, Italian, German, Dutch, Swedish, Norwegian, Danish, Russian and Finnish.

Do you do Boolean?

Grapeshot supports Boolean searching across all fields, as well as field specific queries, with AND, OR, NOT and other familiar Boolean operands. Grapeshot indexes can hold complete positional information of words inside a source document, thereby aiding proximity and phrase searching. Grapeshot provides a combination of advanced search techniques with a commitment to support traditional approaches to IR.

What is the largest index size you can build?

The largest index built by the Grapeshot team was actually the 500 million document index of WebTop.com back in 1999/2000. This system spidered 20 million documents per day and constantly refreshed a large index of the internet - at the time WebTop.com was almost twice the size of Alta-Vista and Excite, popular search engines of that era. WebTop.com used Dr. Martin Porter's algorithms embedded within the older Muscat software to create a distributed index across four, then eight, servers.

The Grapeshot code, written more recently, is much more efficient in its indexing routines, more clever in how data is stored inside the inverted B-tree files, as well as using more recently devised best-match algorithms to power the search. The result is a step change in performance, as well as a significant reduction in the size of the code footprint, and hence improved memory resource efficiencies too. Grapeshot has regularly been used for indexes between 10 and 100 million documents in size, but there is no reason not to consider building index collections of many billion documents, with Grapeshot.

When are the indexes updated?

Grapeshot can update an index file simultaneously with people searching the same file. The indexing process can write to an index file at the same moment that query processes are reading the same index file. So index updating can happen at any time.

Can you search across many index files?

Grapeshot can search across 10, 20, 50, 100 or more Grapeshot indexes as one search - using an IRlist function where you can choose the indexes to run the query against. There is a powerful toolset for building a unified search capability across distributed indexing processes. For example some indexes might update each second, based on rapid news-flow or changing parameters inside an SQL database. Other indexes might be created from daily crawls of certain document depositories or intranet resouces. With Grapeshot, separate indexes (each with their own format) can be created under different administrative and technical conditions, yet unified as a single search, if required. Users could optionally search one or more indexes of their choice, as part of a multi-index information architecture.

Are there performance hits for distributed searching?

Searching across multiple indexes does not have any significant speed performance hit, in terms of the speed of Grapeshot algorithms. As outlined above, Grapeshot algorithms are very fast, such that disc read times for the computer hardware is often the more limiting factor on the performance of a large index system. Grapeshot considers its distributed search methods to be extremely efficient, and powerfully useful when combined with Grapeshot's ability to simultanously write new data to an index at the same time as users are searching the very same index.

How do you change an index file?

There are simple API calls that allow you to add a document to the B-tree. When updating a document it is actually faster to collect the document from the Grapeshot index, delete it and add it back as a fresh document; than it is to recall it, edit it and update it. However both methods are available. Likewise it is easy to purge documents on an individual basis from any index.

What skills do I need to use Grapeshot?

Grapeshot has three levels of API. The C code can be wrapped inside Java or C++ so ideally a programmer needs some awareness of C and the programming environment in which they expect to embed Grapeshot. A second API method is XML scripting that instructs Grapeshot to perform Grape scripts - XML skills are useful. A third method is via the Grapeshot command line which offers over 60 modules as commands with parameters. The command line interface has the feel of a Unix prompt, but the commands and parameters are specific to Grapeshot - so some patience is required to read the API Reference Documentation and get familiarised with those commands and how they work.

What happens if there is a crash mid-process?

Following a crash, each Grapeshot IR system is left in a state of integrity. That is to say, it is left in the state it was in just before the updating process began during which the process was interrupted by the crash. This is useful for any user of Grapeshot and is essential for any installation demanding 24/7 mission critical performance.