WordRank Algorithms

Grapeshot continuously calculates the value of words
WordRank deciphers concepts and builds user profiles. Every word has a weight - and it changes for each user.
The use of probabilistic models in the Grapeshot algorithms is vital to provide an adaptive learning capacity.
WordRank
Word weights help to establish the most significant N terms per document, and are used as a primary indicator of the concepts in a document: to heighten high precision search, as against just high recall. More significantly, when a user looks at one document - the word weights of all words in that document change, only for that one user. It means that for a given audience of say 1 million users, there is a spectrum of words weights attached to any one word, right across the population of users. This means the same word "ipod" , for example can have different weights for different users, thereby starting to provide a personal sense of relevance.
Concept Clouds and User Profiles
Grapeshot can log the word weights for each user, which act as a "concept cloud" around the user, and helps to favour certain new documents that arrive in a publishers content management systems. Therefore if I am interested in "iTunes", "ipods" and have the words already weighted high in my personal profile, then any new data that includes "ipods" (as a text word, or audio/video sequence token) will be routed to me rather than other users who do not feature such high personalized weights within their particular profiles. Note that profiles can be hidden (implicit) or available to be seen by the user (explicit).
Grapeshot has the unique "WordRank" algorithms that can attribute variable word weights to each and every user.
With the advantage of a spectrum of word weights, per user, for any given word; Grapeshot can now advance advertisements or content to the top 10% or 25% percentiles of the spectrum range - offering enhanced targeting or content (adserving and business intelligence) or a personalized content experience (for the publisher's "know your customer and audience" agenda).
Significance of WordRank
Grapeshot ranks words, not documents - and this paradigm means Grapeshot can unlock a host of features:
- Rank the words in a document at indexing time (creates Term Profile)
- Rank words for an individual user based on navigation (inputs to User Profile)
- Absolute summarization - cut a document down to any size
- Use a document as a query - for categorization, real-time alerting and typing-free Dooodle Pad functionality
- Query expansion - suggest other "folksonomy-like" terms to the user
- Document clustering - aids exploration of a corpus of results (see Dooodle Joystick)
- Meta-Tagging - use word ranking to suggest keywords
- Document classification - calculate the best categories based on best terms correlation
There are two fundamental Grapeshot USPs:
- 1) Grapeshot ranks words, not documents (variable weights isolate significant content and users)
- 2) Grapeshot has small code footprint: (speed and performance software engineering)
Getting Into The Algorithms
To understand search, and how it works, one needs to get behind the scenes and into the algorithms themselves.
Two Cambridge University academics who have had a major impact on probabilistic information retrieval are Karen Sparck-Jones and Stephen Robertson.
Both were part of the cadre of researchers working together with Dr. Martin Porter at the Cambridge Computing Lab in the 80s.
Karen taught Information Retrieval at Cambridge for many years contributed hugely to the TREC benchmark, whilst Stephen now leads on search at the Microsoft Research Labs.
Grapeshot draws on the latest BM25 algorithms developed in the last five years and created its own WordRank algorithms for fast robust indexing and search systems.
Probabilistic Information Retrieval and the Heritage of WordRank
The method of using probability maths to improve search precision and recall is not a new idea. Cambridge University academics have worked on this subject since the 1970's, applying Bayesian techniques to relevance feedback, to improve the quality of search results for a user - long before the rise of the internet.

Most search technology uses word frequencies to rank documents

Grapeshot ranks each word, and changes the word's weight throughout user interactions
At one time probabilistic techniques were used to counter the impracticalities of Boolean retrieval - where people use an AND or an OR in their search request. Although SQL and other relational database still have a rigid query syntax, the more familiar search these days uses relevance ranking to determine which documents in the search result (via an AND or an OR expression) are in fact more relevant.
Verity and other American search engines, funded through early CIA or US Government sources during the 1980s, used the academic ideas of Gerry Salton and his SMART model. In essence a relevant document was one where the search words appeared more frequently. Of course some documents are longer than others, so a formula had to embrace document length to get a sense of overall word density. But essentially the pattern of words inside a document led to the document itself getting a relevance score.
Verity then extended the model with Topics - the idea being that the one word Apollo, could have Neil Armstrong, moon or mission as related words. This makes sense if you are building a thesaurus or topic tree or related words: in essence an early version of the value-add information science work that goes on today building predefined relationships between words as a thesaurus or semantic web.
The probabilistic model allows words to be weighted, not just documents. The probability of a word being significant is based initially on distribution through the corpus. It means the word "lawyer" in a legal dataset is not as significant as the word lawyer in a boating magazine. - purely based on the skew from a "normal" distribution of occurrence.
However things get exciting when the user, an individual user, navigates through a set of documents. This smaller corpus of documents visited, read or marked as interesting by the user - will have a different distribution for the word lawyer, as compared to both the legal database or the boating magazine. If we model probability across the user's own document footprint and the corpus being queried at large, then a word can have a unique weight for that one user.
Grapeshot uses the probability mathematics to get a weight behind each search term. The benefits are:
- Ability to suggest words related to the one word used by the searcher - with no need to pre-create topics or thesauri
- Ability to accept search queries with 200 words or more, as the software can determine easily which are the more important words
- Ability to adjust the word weights for each user, based on their own pattern of document consumption
Herein likes the distinction with the more normal search engine systems that rank the documents. Grapeshot can uniquely rank words and modify those weights in real time based on user interactions.




