Pattern | CLiPS

Pattern is a web mining module for the Python programming language. It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).

Shelfari

Shelfari is a community-powered encyclopedia for book lovers. Create a virtual bookshelf, discover new books, connect with friends and learn more about your favorite books – all for free.

LibraryThing

A home for your books. Enter what you’re reading or your whole library. It’s an easy, library-quality catalog

Goodreads: Book Reviews and Recommendations

Have you ever wanted a better way to:Get great book recommendations from people you know.Keep track of what you've read and what you'd like to read.Form a book club, answer book trivia, collect your favorite quotes.

PDFMiner

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows to obtain the exact location of texts in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

tf–idf - Wikipedia, the free encyclopedia

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Markov chain - Wikipedia, the free encyclopedia

A Markov chain, named for Andrey Markov, is a mathematical system that undergoes transitions from one state to another (from a finite or countable number of possible states) in a chainlike manner. It is a random process endowed with the Markov property: the next state depends only on the current state and not on the past. Markov chains have many applications as statistical models of real-world processes.

LingPipe: Signicant Phrases Tutorial

LingPipe provides a simple way find statistically significant phrases in a document collection. There are two types of significance of interest.

Cinemetrics

This project is about visualizing movie data to reveal the characteristics of movies and to make them formally comparable.

HALLO PIET