05-06-2011
0
3 words
01-06-2011
0
54 words

Fancy shape word clouds

Tagxedo turns words — famous speeches, news articles, slogans and themes, even your love letters — into a visually stunning word cloud, words individually sized appropriately to highlight the frequencies of occurrence within the body of text.

Yeah, visually stunning. Who wouldn’t want to turn an Obama speech in a banana shaped word cloud?

03-05-2011
0
13 words

The tablet newspaper

I have no idea if its genuine. But the idea is spot on!

19-04-2011
0
84 words

Book repository

Google may be digitalizing every book, the University of Chicago decided to do both. They’re building an enormous underground robotized book repository that will store all their 3.5 million books.

The Joe and Rika Mansueto Library’s ASRS will shelve materials underground by size rather than library classification, in racks 50 feet high, with a capacity to hold 3.5 million volumes in one-seventh of the space of conventional shelves.

Just take a look at the pictures to be amazed about this project.




Read more about the project here.

07-04-2011
0
390 words

Content level

My previous post featured a layered setup of a book. The most general level, within these layers, I referred to as the content level.

The content level features a lot of information of which a lot is irrelevant or too detailed to be put into a summary. To choose for the right information to display, I will be approaching the content level from a user experience direction. What does the user want to know? A quick query among readers lead me to generalize the most important cornerstones in three categories: context, critics and identity.

Context
The context is about the books contents and its immediate surrounding (in the public information bubble). It should feature information about related/similar books, the key subject(s) of the book and (last but not least) the salient subject(s) of the book. Salient subjects define why this book is special within its context (added value).

Critics
Apart from the fact that future-readers want to know what the book is about, they’d also like to know what other people think of it. Critical acclaim is an important factor. But not only professional opinions are used to base a decision on (hence the popularity of reviews at amazon.com & bol.com). News, reviews and time intertwine within this block.

Identity
The two previous categories are still lacking one thing, the connection with the user, the reader. Users want to read reviews that fit your mindset, they’d like to see books within their interest field (and not just a book someone unknown has read too). Adding specific user information to the model helps users to relate with the book (and hopefully, makes the eventual buying decision a little tat easier). In short, the identity category bridges the context and critics of the book to the users personal liking.

The three essential blocks described above are in this explanation related to a book level perspective. When descending one story, entering the chapter level, I noticed the three blocks still perpetuate. A chapter shows similar features like a book. It is likely to be more focussed concerning content, but the content itself is too large (still) to neglect generalization. That is why I clustered both layers in the content level. My approach to the content of the book and chapter layers will therefore be similar (context, critics and identity).

04-04-2011
0
379 words

Layering a book

To get a grip on the amount of information in a book, I defined the build-up layers of a book. The top down approach would be: book, chapter, paragraph, sentence, phrase, word, syllable, character.

These layers all have their own specification, but some are also very similar. To generalize these clusters of layers I defined four levels:

  • Constructor level
    The constructor level consists out of characters and syllables. These are the blocks that are used to create words and phrases. Concerning interpretation (literally), they do not have that much to tell a reader, but they are very important for automatic analysis of a text.
  • Emotion level
    Within the emotion level, things start to get meaning. A word or phrase can have a semantic value. Some words change their semantic orientation based on its preceding or succeeding words. Still, this level is more important for automatic analysis than it is to the reader.
  • Quote level
    For a reader, this is the most interesting level. The quote level is the most detailed level. This is where phrases come together in sentences and paragraphs which in turn, can be interpreted by its reader. It is called the quote level since often this is the level readers refer to if their explaining their thought on the text.
  • Content level
    The content level is the most general level as it combines all paragraphs into a coherent picture. It uses several steps of interpretation (or for a computer, aggregation) to come to this generalized image of text.

The user interest barrier is a devision I based upon the deepest level readers are interested in. In other words, this is the level where stuff starts to make sense to a reader.

The creating of these levels help me to define a scope within the vast amount of information within a book. Within every level, I’m experimenting with several visual outputs that represent the inhabited amount of information. These experiments are the building blocks for the eventual visual language that will occupy information from all the defined levels within a book.

These levels, on their turn, can be devised into several groups that define what this level should represent (from a user experience perspective). More on this, later (enough theory for now, at least today that is).

25-03-2011
0
109 words

Ideological extremity on Twitter

One of the resources I’d like to use for determining user preference is Twitter. Twitter users post a lot of subjective data that can be used to create an identity that represents a users principles.

David B. Sparks thought similar about this. He used people’s Twitter behavior to estimate the political ideology of Senators, Challengers, Media and others.

No details on how exactly he estimated this ideology based on Twitter behavior but he’ll be posting a paper about this soon. In the meantime, I advice you to take a look at his results and resolve what you think about this approach. (To me, thinks look promisingly good).

24-03-2011
0
159 words

Think quarterly

It has been silent around these parts of The Internet. I want to apologize for that. During the last two weeks I moved from research to the design part of my project. This means visual output is being created but as you may know, it’s always a difficult phase to go through. Slowly the boundaries of my model are shaping up. Coming weeks will be sweet for those of you who like to see (fancy and less fancy but nevertheless) pictures.

In the meantime we’re reading Think Quarterly. Google regularly updates their partners with what there up to lately. On this occasion they published a book/magazine on the subject of data.

Our first issue is dedicated to Data – amongst a morass of information, how can you find the magic metrics that will help transform your business? We hope that you find inspiration, insights, and more, in Think Quarterly.

You can read the book/magazine here (thinkquarterly.co.uk).

08-03-2011
0
128 words

Movie barcodes

The idea behind movie barcodes is pretty easy. Take every frame of a movie, squeeze it into a thin slice and put these slices next to each other, et c’est ça. The idea is pretty simple but the results are nice either way. The mood of each film is illustrated quite nicely.

Figure 1. Kung Fu Panda

Figure 2. A Single Man

Figure 3. Slumdog Millionaire

Within my own model I want to try to use similar ways to use statistical book data (word count etc.) to define the main shape of the visualization. The eventual fulfillment of this main shape will be done with semantic data, context data and user data. I’ll be describing this model in the forthcoming days.

First, let’s enjoy the nice collection of movie barcodes.

28-02-2011
0
147 words

Future of the book

IDEO NY made a small feature film about their future vision on the (tablet) book. It’s from september 2010. I’ve seen it before but somehow it didn’t end up at this blog (mea maxima culpa). It does now!

Meet Nelson, Coupland, and Alice — the faces of tomorrow’s book. Watch global design and innovation consultancy IDEO’s vision for the future of the book. What new experiences might be created by linking diverse discussions, what additional value could be created by connected readers to one another, and what innovative ways we might use to tell our favorite stories and build community around books?

Especially the Nelson concept is very interesting. This is however fully depended on user input data, which in my future vision, will be enriched with the auto generated semantic data of the book itself.

Editorial note: I changed the vertical scrolling lay-out. Yes, I surrender.

21-02-2011
0
134 words

Florarium Temporum

Dimitri was kind enough to introduce me with Nicolaas Clopper jr. Nicolaas dad, Nicolaas Clopper sr., assigned him with the task to write a chronicle that would give inside into all the worlds information. In 1472 Clopper jr. finished his chronicle on the world, called Florarium Temporum (in Dutch: Bloemhof der Tijden).

Instead of just writing it down, Clopper jr. found a way to rasterize and index the information. The book is equipped with a horizontal timeline that runs through the complete book. Every subgenre has its own timeline. This made the 300+ pages book easy to navigate. Next to the timeframe, he added a register in the back to look things up quickly.

What I did not know is that the complete chronicle is availble online at the Digitalte Bibliothek München. Definitely worth a look!

18-02-2011
0
392 words

Narrative identity

Yesterday I attended the Wireless Stories – New Media in Public Space – conference at the Stadsschouwburg Amsterdam.

The day kicked of with a theoretical talk of Michiel de Lange about wireless storytelling and publicness in the mobile city. I was especially interested in his part about narratives and the importance of storytelling.

Michiels vision on storytelling was based on writings of French philosopher Paul Ricœur. Ricœur is best know for his work on humans having a narrative identity.

Figure 1. Paul Paul Ricœur on the left.

The narrative identity is a construction of the ‘self’ which gives people direction, meaning and value to their lives. This identity helps people relate to the world, other people and themselves. Ricœur argues that this narrative identity is divided into a idem-identity and a ipse-identity. Idem being the physical constancy of self being in time (being one and the same person). Ipse refers to the perception of ourself as a personal entity.

Ricœur uses this narrative identity as metaphor and as a means of how we create our identity. How? Ricœur defined three stages:

  1. Narrative pre-understanding: implicit knowledge of life; experience.
  2. Narrative emplotment: configuring a story in a plot that arranges all influencing elements into one coherent unity; expression.
  3. Narrative self-interpretation: reconfiguration of our identity due to interpretation of our own stories and those of others; reflection.

Ricœur states that our life will have coherence and meaning if we are able to discern and disentangle its stories. We are destined to interpret everything we encounter during life. Literature is the perfect ground to train and perfect these skills.

Put this in perspective of this assignment and it opens a whole new world. What if we give every book its own narrative identity? A book has its own pre-understanding, namely its contents and the principles it is written on. A book should be conscience of its own existence. If so, it could find a way to interpret, plot and reflect on its proceedings (reviews, comments et cetera). Reflecting his own identity on these events leads to a unique dynamic narrative identity for every book.

Time to read more on Ricœur.

More on Ricœur at plato.stanford.edu
Lecture slides Michiel de Lange
Article on Paul Ricœur’s Temps et récite (I-III) in NRC Handelsblad 10-2-1995 (in Dutch)

11-02-2011
0
494 words

Twitter readability

Back to readability again! In previous posts I’ve tested readability on my own blog. It shines a light on my terrible english writing, but does not compare it to, let’s say, Walter Aprile’s pochedespès. I wouldn’t even dare to challenge his English writing, that’s a lost battle anyway, so I tried comparing Twitter user streams. My own twitter stream isn’t of much use since:

A. it’s in Dutch and;
B. I don’t use it that often (not at all actually).

That rules out my account. Mister Aprile is a regular Twitter user though, so why not use his account as input. I made two little tests. The first one extracts a specified amount of tweets from the queried user’s account and parses them, together with a graph that shows the readability per tweet. This can be done side-by-side with a second user account to see the results simultaneously. For this test, I choose John Maeda as Walter Aprile’s opponent.

The resulted graphs look quite spiky. FORCAST performs pretty stable as it is made for short text analysis (150 word samples). The other formula’s are made for more thorough analysis. They might need more text to be really useful (more on this in a bit). Another thing is that the Coleman-Liau Index seems to differ a lot from all the other advanced formula’s (who clutter together nice and cosy).

On to step two. The real comparison. What if we accumulate all the scores from the previous shown graph? This still doesn’t really solve the small text sample problem (I’ll come to that later) but in a battle, in the end, all we want to know is who’s winning.

This puts things into perspective, and so far things look pretty good for Walter. Maeda’s and Aprile’s Twitter streams are, concerning readability, almost similar. This is however based on the accumulation of scores, ergo still working with small text samples. Experiment three sums up all the tweets to make it one long text before calculating a readability index. The hypothesis I stated was that the scores should be more realistic and the differences in scores (between Twitter users) might change radical.

As can be seen from the above image, the radical change did not occur. Scores did change a little and the difference between the two users is a little tat more significant (a small disadvantage for Mr. Baffo), but all in all, readability scores of both users fluctuate around the same grade levels. Seems like the more complex readability indexes (compared to FORCAST) work pretty well on both short and long text samples.

Note 1: Dear Walter, if you are reading this, I thought John Maeda would be a worthy opponent. Justin Bieber came across my mind first, but reading his tweets somehow made me change my mind.

Note 2: All examples are available upon request. I keep the URL’s hidden to avoid abuse (too many Twitter API calls result in a ban, which we do not want!).

09-02-2011
0
105 words

Strata conference

Last week (1-3 February) O’Reilly hosted a conference on data. It featured talks about collecting, storing, organizing, analyzing and publishing data. There were a lot of interesting speakers. Some of the talks are published on youtube (check out the playlist here). I liked the talk of Werner Vogels of amazon.com (on how he thinks about big data, and that you can actually FEDEX your data to the amazon cloud) and a general talk of Hilary Mason from bit.ly on the importance of data. The last talk featured some interesting and funny examples of large datasets.

The playlist with all the talks on youtube.

08-02-2011
0
101 words

Readability continued(ed)

As one of the respected readers of this blog pointed out, all the previous mentioned formula’s start making sense within a context.

I set up a page that calculates and lists all readability scores and text statistics of this blog. This resulted in a neatly organized table with a lot of numbers. To make things more understandable (and to fully meet Koen’s wishes) I graphed them.

FORCAST, which uses only the single syllable words in its formula, differs radically from its colleagues. The other formula’s create a somehow similar hilly landscape (although some margins are quite large).

See it for yourself!

03-02-2011
4
150 words

Readability continued

The previous post listed a lot of formula’s. They sure look nice but things start to get interesting when you can see the actual results of the described formula’s.

The test version I created does not include LIX, FORCAST, Dale-Chall and Spache yet. They need a little more tweaking before I can use them as proper comparison tool for the other tests.

Figure 1. Readability test with (again) the column of Don Norman.

The indexed grades vary from 11.4 up to 15.8, meaning a difference of four grades. That is quite a big gap if I want to use this information as input for my visual model. I might need to think of a way to average the results realistically or use only one method instead of all.

When the tool is fully finished I will embed it in this blog too. Although maybe I won’t if all the grade come out low.

01-02-2011
0
498 words

Readability

Apart from the semantic text data I am trying to extract from text, there is the normal statistical layer of information, providing the total number of words, sentences, length of sentences et cetera. These text statistics can be used to define the grade of a text’s readability.

read•a•ble
adjective
(of a text, script, or code) able to be read or deciphered; legible.

Source: New Oxford American Dictionary

Until halfway the last century, educators and publishers manually define text difficulty. Grading text that way is far from objective and even for a skilled reader difficult. More objective ways to define readability were more than welcome. This resulted in several formula’s that measure readability of text, mostly defined by the USA grade level needed to comprehend a piece of text. These formula’s can be split into three categories: word & sentence, complex words & syllables and graph formula’s.

1. Word & sentence formula’s

  • Automated Readability Index (ARI)
    ARI = 4.71*(characters/words)+0.5*(words/sentences)-21.43
  • Coleman-Liau Index (CLI)
    CLI = 0.0588*(average number of letters)-0.296*(average number of senteces)-15.8
    Based on a 100-words sample.
  • LIX

    LIX = total words/total periods+(long words*100)/total
    Long words are words with more that 6 letters.

    LIX : Weight
    < 24 : Very easy
    25-34 : Easy
    35-44 : Moderate
    45-54 : Severe
    > 55 : Very difficult

2. Complex words & syllable formula’s

  • Dale-Chall readability index
    Score = 0.1579*Percentage complex words + 0.0496 Average words per sentence + 3.6365
    Dale and Chall provided a list of 3000 simple words. If the word is not in that list, a word is considered complex.

    Score : Grade
    < 4.9 : < 4
    5.0-5.9 : 5-6
    6.0-6.9 : 7-8
    7.0-7.9 : 9-10
    8.0-8.9 : 11-12
    9.0-5.9 : 13-15 (college)
    > 10 : > 16 (college graduate)

  • Flesch reading ease
    FRES = 206.835-1.015*(total words/total sentences)-84.6*(total syllables/total words)
    Max is around 120 (which means, very easy to read).
  • Flesch-Kincaid grade level
    FKGL = 0.39*(total words/total sentences)+11.8*(total syllables/total words)-15.59
    Converts Flesch reading ease into a USA grade level.
  • FORCAST
    FLG = 20-((number of single-syllable words)/10)
    The FORCAST formula uses the number of single-syllable words in a 150 words sample text.
  • Gunning Fog Index (GFI)
    GFI = 0.4*((words/sentence)+100*(complex words/words))
    Complex words are words with 3 or more syllables.
  • Simple Measure Of Gobbledygook (SMOG)
    SMOG = 1.043*sqrt(30*(total polysyllables/total sentences)+3.1291)
    Polysyllables are words with 3 or more syllables.
  • Spache readability formula
    Spache original = (0.141*avarage sentence length)+(0.086*number of unique unfamiliar words)+0.839
    Spache refined = (0.121*avarage sentence length)+(0.082*number of unique unfamiliar words)+0.659

    Spache’s formula is intented for text grading upon 4th grade and includes, just like Dale-Chall’s formula a list of familiar words.

3. Graph’s

  • Fry readability graph

    1. Select 100 words
    2. Total sentences
    3. Total syllables
    4. Map it on the graph

  • Raygor Estimate Graph

    1. Select 100 words
    2. Total sentences (half sentences as 0.5)
    3. Total words containing more than 5 letters
    4. Map it on the graph

The outcomes of formula’s like these might be interesting to include as input stream for my visualization.

Note: There are two other more advance readability tools: Accelerated Reader ATOS and Lexile. Their formula’s are secret and therefor excluded from the list above.

27-01-2011
0
259 words

POSSWN

As I mentioned earlier, I’ve received a license to use SentiWordNet (SWN) for my project. SWN is a lexical resource for opinion mining purposes. It occupies around 200.000 words with positive, negative and objective sentiment scores. Words are evaluated between 0 and 1 in 0.125 scale steps. The objectivity factor is calculated from the positive and negative scores (obj = 1-(pos+neg)). Every word is linked with its WordNet key and a Part of Speech (POS) tag. To start playing with text and SWN, a POS tagger is needed to get the corresponding words out of SWN. It would be a shame if SWN would use the adverb scores for good instead of the noun one. I used a ported (C to PHP) Brill Tagger (Eric Brill 1993) I found on github to POS tag the input text. See the image for the result.

Figure 1. From left to right: Input, POS, SWN. Red is negative, green positive and blue objective. Input text by Don Norman.

First thing you notice after seeing the output? Yes, it’s pretty blue.

What does that mean? – I need to know more about the context. Words can change in opinion radically when occupying a different context. Direct classification could work for some words (like good or bad) but for others (like warm or cold) it levels out the results.

Lets turn it the other way around though. What if I scan through a text using positive and negative scores to evaluate it’s sentiment value instead of just adding up the hits in the SWN database?

Onto the next experiment!

26-01-2011
0
115 words

Semantic web, linked data & open data

datavisualization.ch published a nice little article explaining The Semantic Web, linked data and open data and pointing out the importance of linked open data for future designers.

“At last, to describe data that is open and linked, there’s the combination of the two, Linked Open Data. This is the data we, as visualization creators, want, because it has clear license terms and is easily linkable with other data sets. To put these terms in relation to each other, I created the following graphic; in the world of all data, only the blue areas are open to the public, with the dark blue being open and linked.”

Read the full article at datavisualization.ch.

25-01-2011
0
437 words

Directions of data-mining

Within the field of data-mining, there are four main distinguishable approaches.

1. Machine Learning
Machine Learning (ML) is part of the Artificial Intelligence (AI) branch. Machine Learning algorithms evolve their behavior based upon training data. The trained algorithm, on his turn, can be used to classify the input data.

Pro

  • Very accurate if trained properly
  • Adapts to the trained corpus1

Con

  • Requires a large annotated quality corpus
  • Requires training, ergo time consuming
  • Subject to overtraining (adjusted to certain data that in unknown situations hardly ever occurs)

2. Dictionary
Dictionary algorithms use, as the name might suggest, a dictionary that is equipped with opinion classification. These algorithms are commonly used to identify opinionated words. For classification it is less favorable since the semantic orientation of a word is highly depended on its context. The orientation of the word cold might change dramatically when it is associated with feet instead of beer.

Pro

  • Very good at identification of opinion words

Con

  • Not feasible within particular datasets (especially expert texts)
  • Adaptive orientation to local context hard to accomplish

3. Statistics
The statistical approach uses the field of statistics to calculate semantic word orientation. It is based on the theory that similar opinion words occur together more frequently. If the orientation of a word is unknown, it is likely to be positive if it occurs in positive text more frequently (and vice versa). A very large corpus is required to calculate plausible orientation values. To solve this problem, researchers used AltaVista (Turney 2002) and Google (Chaovalit et al. 2005) to query The Internet as corpus.

Pro

  • Unsupervised
  • Using Internet as corpus
  • Possible to provide orientation of unknown words

Con

  • Needs a large corpus

4. Semantics
Some dictionaries, like WordNet, are equipped with semantic information. This information can be used to calculate semantic relationships between words. Examples are to follow the shortest path of synonyms/antonyms between a known and unknown word to define a word’s semantic orientation (e.g. technical > expert > good) or to count the amount of positive and negative synonyms of an unknown word.

Pro

  • Using semantic data directly

Con

  • Discussion about how to define the amount of change when comparing words on synonyms/antonyms or calculation of the shortest path

It should be noted that the four described approaches are often used together to enhance results. There is not just one superior direction for a identification/classification problem. For this project, no particular direction is chosen yet. For the time being a combination of the semantic and statistical looks promising, the so called SO-PMI-IR approach (Semantic Orientation Pointwise Mutual Information Information Retrieval, AWSM!).

1 Corpus: A large structured (often annotated) set of texts.

18-01-2011
0
439 words

Subjectivity analysis

Before we can start making visual models of semantic text data, we need semantically indexed and labeled data. Within the scope of the assignment, we are not interested in the purely statistical facts about text. To get this extra layer of information out of a document the field of text analysis will be used. Even more specifically, we will be looking into the field of Opinion Mining or Sentiment Analysis.

Opinion Mining or Sentiment Analysis are both often used within the same context. The problem they try to solve is similar. The only difference is that Opinion Mining originated from the field of Information Retrieval and Sentiment analysis on the other side arose from the field of Natural Language Processing. To avoid confusion, Subjectivity Analysis will be used within this report when we refer to all the fields mentioned above.

Subjectivity Analysis is a specialized research field that is related to Information Retrieval (IR), Artificial Intelligence (AI) and Natural Language Processing (NLP).

  • Information Retrieval is the field of science that searches for data and meta data in documents, World Wide Web and other sources of information;
  • Artificial Intelligence is the field of science that tries to imitate/create intelligence in computers;
  • Natural Language Processing is the field of science, part of Information Science, that deals with the human-machine language interaction.

When analyzing a document or text, subjectivity analysis can be divided into three steps:

  1. Identification (of topics and subjective sentences)
  2. Classification (of sentences and documents)
  3. Aggregation (of classified sentences and documents)

Figure 1. The three steps of Sentiment Analysis. I will be extending this diagram during the week in the forthcoming posts. Collection is an essential pre-analysis step (see: Collection of data) and therefore included

The first two steps (described above) are often referred to as the Data Mining process. Data Mining is a technique to search for patterns and correlation in large datasets.

Within this project I will be looking for suitable methods to perform step one and two. Data Mining (and more specifically Sentiment Analysis) is a very active research field. This is partly due to the demand to create smart search and indexing solutions for the ever growing Internet. I will be using a lot of this research to select a suitable Data Mining method for this project.

The third step, aggregation, is different though. This is were the project will really start. How to make sense of the data that has been identified and labeled within the Data Mining steps?

This will be the main thinking direction the forthcoming days. Meanwhile, I will be posting about the directions of data mining I distinguished, later this week.

13-01-2011
2
164 words

Collection of data

As papers accumulate in my Mendeley, I thought it might be a good idea to start collection some data. It might save me from scraping the web in about a month. To do this I made myself a little database and a bookmarklet that automatically saves a selected text into my database. I’m doing 1-2 columns a day for a week now, so my collection is starting to get some shape.

Talking about databases, 2011 came with two nice surprises. Just before the end of 2010 I requested usage licenses for SentiWordNet and the WordNet Affect domain. I got both! SentiWordNet is and extension of WordNet with semantic polarity applied to it. WordNet Affect does something similar by defining all synsets with affective states whose valence depends on the semantic context. Both lexicons might be very useful during the next phase when they can be used as corpus or dictionary for the algorithms I like to test. I’ll be telling more about this later this week.

30-12-2010
0
107 words

The joy of stats

I think it’s two week ago that I watched Hans Rosling speaking (with his funny little accent) about the power and joy of statistics in some kind of fancy pantsy augmented reality setting. What I didn’t know is that BBC made a documentairy about it including David McCandless (Information is beautiful) and Hans Rosling himself.

Hans Rosling says there’s nothing boring about stats, and then goes on to prove it. Only with statistics can we make sense of the world and harness the data deluge to serve us rather than drown in its confusion.
A one-hour long documentary produced by Wingspan Productions and broadcast by BBC, 2010.


23-12-2010
0
97 words

IDKWIGBIWTBT

Last weekend I visited I Don’t Know Where I’m Going But I Want To Be There, a symposium about the expanding field of graphic design. Graphic Design Museum Breda compiled a small recap with highlights of the event.

You can check all the other video’s of the symposium here.

As you may have noticed from my stream of bookmarks I’ve been reading up on Natural Language Processing to check on the posibilities of semantic text labeling. I’ll be posting my findings next week when I’ve spend some time sorting and describing all possibilities.

Until then, Merry Christmas!

17-12-2010
0
106 words

Books Ngram Viewer by Google Labs

Google’s pursuit of collecting information lead them to digitalizing books. They have scanned around 15 million books (roughly 12 percent of every book ever published). Mister Jean-Baptiste Michel and some of his colleagues used this as source to investigate cultural trends. The good thing. The database is available online (and in RAW).

So what does it do?

When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over the selected years.

Below a small experiment with sideburns versus sideboards.

Seems the roaring seventies liked their sideburns.

16-12-2010
0
110 words

Analysing data is the future for journalists

Tim Berners Lee, the man behind the World Wide Web thinks data journalism is the future.

“The responsibility needs to be with the press,” Berners-Lee responded firmly. “Journalists need to be data-savvy. It used to be that you would get stories by chatting to people in bars, and it still might be that you’ll do it that way some times.”

“Data-driven journalism is the future,” Berners-Lee insisted. To which his colleague Nigel Shadbolt, who with Berners-Lee has been working to get the civil service and local government to open up their data, added succinctly: “Well, part of the future.”

Seems non fiction journalism literature isn’t a bad choice at all.

14-12-2010
1
75 words

Mendeley

As this assignment will involve a lot of literature, I thought it would be a splendid idea to be saving all my files using some kind of online syncing service. Currently I am using Mendeley. Give me a nudge if you are out there too. I am new to programs like this, its features look promising so far though. Let us hope it will save me a lot of cursing in five to six months.

12-12-2010
0
465 words

A semantic visual text abstract

So what will this assignment be about?

A semantic visual text abstract is a graduation assignment by Servé Custers for LUSTlab. LUSTlab was set up by The Hague based design agency LUST in 2010. Their focus is research on the edge of society where communication, science and technology meets design, interaction and technique.

Within the domain of information, books are still our main source. Before people start reading a book, whatever its genre, they tend to read the blurb on the back to see if it meets their expectations and if they might enjoy reading the complete book. An abstract however, is only a small (biased) reflection of the books content. Users get a mere view on the books content.

The Internet provides a giant bubble of public information that, together with the use of new technology, can help improve experiences like this. Especially with developments like the Semantic Web. The Semantic Web is a framework of methods and techniques that allows machines to manipulate and process the meaning of information. Within this context, computers can be used to generate an interpretation of text. Together with other public information (related articles, location, et cetera) this will provide an interactive view on a book’s spot in the information bubble (and it’s relationship with other items).

Problem definition

How to create a visual text abstract on a semantic level, capturing the emotional value of text into a (static or dynamic) visualization within the domain of non fiction, journalism literature? The result should be related to the semantics of the text, enable quick visual comparison between different texts through their visual summary, related to the content and intuitive in use.

Assignment

Define a system to create a visual text abstract, using visual elements only. The aim of this assignment is not about exploring new ways to visualize the text itself. It seeks to create a visual language that can express the text’s semantics. This means the eventual product will be more than just an analytical visualization of the source text.

First, an analysis on different text mining and text analysis techniques is necessary to define the best data set for visualization purposes with regard to the predefined domain of non fiction and journalism literature (e.g. Joris Luyendijk, Jan Mulder, Prem Radhakishun (in dutch)).

Next, a system has to be defined for visualizing the semantic data. What elements are needed to create a connection between content and visualization while conserving the semantic data (both data & metadata)?

Eventually, this will lead to a visualization that expresses the text semantics, makes it universal comparable with other text sources and is understandable by its users. This will allow users to understand the books content and spot within the public information bubble (without reading the blurb on the back).

02-12-2010
0
126 words

First things first

The coming six months Rant will provide information (or sometimes figments of my imagination) about my (Servé Custers) Master Thesis project at LUSTlab. Eventually this will create a chronological timeline of my process (even on your screen, thanks to horizontal scrolling). Of course, you are encouraged to join the debate and give me as much input/critique as you want.

Best way to keep in touch is to plug this blog in your preferred RSS reader.

Special note for all Internet Explorer users. I used some fancy jQuery stuff, that apparently, Internet Explorer does not like that much. If you want cornered buttons, scrolling navigation and other spiffy features, use Chrome, Firefox, Opera, Safari or any other recently updated Web browser.

That’s all for now, enjoy!

HALLO PIET