Capable
IBM powering WebSphere with Grapeshot

IBM turns to Grapeshot
IBM had spent many years on the WebFountain project, where taxonomies and semantic linguistic relationships were hand crafted to improve the overall navigation of knowledge workers around large collections of documents. Senior IBM executives saw Grapeshot's WordRank, and realised the new algorithms could create relationships between words, with no upfront human costs. With Grapeshot probability algorithms can identify that "Katrina" is related to "Hurricane". Using the 80:20 rule, it seemed vast swathes of documents could be intelligently organised, using Grapeshot, without the need for any human intervention.
- IBM using Grapeshot to personalise the content experience
Intelligent crawling of HTML pages to create packaged XML RSS feeds

Grapeshot snips out the valuable content
Anyone can write a crawler to fetch HTML pages...but it is far harder to parse out the quality content on a page and leave behind the dross. Sorting the "wheat from the chaff" is a particular skill of Grapeshot. Assessing the significance of each word on a page - whether it is a navigational link, piece of advertising copy, main headline or byline is not done with HTML parsing logic - just beacuse so many different webmasters use such different style sheet methods. No, instead Grapeshot assess the significnce of each word, in probability terms, and identifies the core sentences on the page. This highlights what we call the "epicentre" of the page, and it is the core content that the Grapeshot crawlers can extract and package into an XML record for content disemination. So Grapeshot can do more than crawl - it can intelligently assess the correct snippets to cut from a page - all important when new Web 2.0 interfaces show such a myriad of frames and dynamically created content. Grapeshot parses content based on the value of words, not just the simpler HTML parsing of stylesheet codes!
Classifying the Internet
Some customers are using Grapeshot to not just crawl, but also to classify content. One project is indexing 50 million websites every 11 days, collecting over 300 million documents. Each one is analysed by Grapeshot and given a category based on Grapeshot's analysis of the words in each document. Grapeshot's WordRank assigns weights to words, so each document can have a set of 30 "top terms" that act as the primary fingerprint, or DNA, of what that document is about. These term profiles are matched against a category database, also created in Grapeshot. Each Category is defined by an automatic WordRank analysis of ten training documents that exemplify that one Category, so that each category has approximately 400 "top terms". Every new document is used like a search query; its DNA of "top terms" is matched as a search against the database of categories, and the most relevant categories identified. This is a fast way to classify 300 million documents, with very little human overhead.
- Make Grapeshot categorise any URL of your choice - in real time - in the Grapeshot Showreel
XML Databases
It is puzzling why so many people use SQL to store standard text data. The relational query model of SQL is ideal for rapid manipulation of financial data or relationships that have many overlapping linkages. Yet most people use SQL to store names and addresses, or news headlines or simple ASCII data that hardly changes. In the Web 2.0 world of webservices, XML is an evermore important data format, with nested data structures, rather than just simple flat file. Yet people still use the familiar SQL - parsing data out of XML and into SQL: then out of SQL back into XML. Why not keep your data in an XML store and focus on fast textual processing? Grapeshot was built especially as an XML database system: so try it out and do something more useful, and more scalable, than SQL for your modern webservices architecture!
Arabic and Chinese languages
These days we all have to speak many languages. The world is changing - so Arabic and Chinese are significant language barriers that must be broken down. Grapehot delivers intelligent search algorithms for 12 major European languages as well as Chinese, Arabic, Korean and Japanese. Classification, personalisation, automatic recommendation of related words - can all be done across these languages without the need to prepare thesauri or costly, language specific, semantic webs.
- See Grapeshot underpinning a multilingual Dooodle search at http://www.dooodle.com/



