Pattern | CLiPS
Pattern is a web mining module for the Python programming language. It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).
Shelfari
Shelfari is a community-powered encyclopedia for book lovers. Create a virtual bookshelf, discover new books, connect with friends and learn more about your favorite books – all for free.
LibraryThing
A home for your books. Enter what you’re reading or your whole library. It’s an easy, library-quality catalog
BookRabbit a social network for books. Bringing your ...
Another book community.
Goodreads: Book Reviews and Recommendations
Have you ever wanted a better way to:Get great book recommendations from people you know.Keep track of what you've read and what you'd like to read.Form a book club, answer book trivia, collect your favorite quotes.
PDFMiner
PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows to obtain the exact location of texts in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.
tf–idf - Wikipedia, the free encyclopedia
The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.
Markov chain - Wikipedia, the free encyclopedia
A Markov chain, named for Andrey Markov, is a mathematical system that undergoes transitions from one state to another (from a finite or countable number of possible states) in a chainlike manner. It is a random process endowed with the Markov property: the next state depends only on the current state and not on the past. Markov chains have many applications as statistical models of real-world processes.
LingPipe: Signicant Phrases Tutorial
LingPipe provides a simple way find statistically significant phrases in a document collection. There are two types of significance of interest.
Cinemetrics
This project is about visualizing movie data to reveal the characteristics of movies and to make them formally comparable.
HOME
ARCHIVE
TAGS
BOOKMARKS