Geolinguistic contrasts in Wikipedia

Context—Master thesis

We live in a interconnected world defined by a continuously moving interplay of knowledge paradigms which are driven by sciences, laws and religions, amongst others. Across personal and professional spheres, everybody has to deal with distinct changing discourses and truths in order to create a personal perception, which ends up to be the reality that is lived in.

Knowledge results from a “complex process that is social, goal-driven, contextual, and culturally- bound”¹. As the book as such is being supplanted by digital networks as the main metaphor of knowledge, the previously separated ecosystems originated from the book era are evolving into a global network, which new intrications imply disrupting changes: the meaning of learning, memorizing, sharing, communicating or even dealing with the very own opinion are shifting.

Weinberger, D. (2010, February 2). The Problem with the Data-Information-Knowledge-Wisdom Hierarchy. Harvard Business Review. Retrieved April 29, 2013, from http://blogs.hbr.org/cs/2010/02/data_is_to_info_as_info_is_not.html ↩︎

About Wikipedia

Over the last ten years, Wikipedia, a knowledge network, has irrevocably supplanted the traditional encyclopedias and has become a ubiquitous source across cultures. Because of its multilingualism, its huge volume and its highly democratic structure, Wikipedia is not as much curated and reviewed as “classical” scientific publications and opens accordingly ways to display information reflecting very different world views or angles. Every Wikipedia article is entitled to present a truth which rely on factual information. However, because of its composite nature, which rely on a constant editing made by a multitude of users sometimes focusing on very different aspects, it depicts a far more complex reality, which is sometimes hard to apprehend. Wikipedia articles are a great source to get to know a topic, but if we, as (informed) users, want to go further into it, it is also very important to question the scope and look for alternative sources so that the newly-acquired knowledge can be reliably consolidated.

Consolidating Knowledge from Wikipedia One of the possible strategy to consolidate knowledge acquired within Wikipedia is to explore the different language versions of an article—this obviously depending on which language we personally master. This strategy happens to be a great way to discover and question discrepancies as well as commonalities across languages. Reflecting on those really helps on weighting the important facts, discerning differing opinions or cultural backgrounds and tracking back the largest possible range of external references (which back any well reviewed article).

Looking at what has been done so far

At the beginning of the project, I tried to see how cultural and knowledge diversity out of Wikipedia has been visualized and I went across a row of interesting projects, highlighting that people who edit articles about places don‘t necessarily live nearby ([Who edits Wikipedia? A map of edits to articles about Egypt](http://www.zerogeography.net/2013/03/who-edits-wikipedia-map-of-edits-to.html „http://www.zerogeography.net/2013/03/who-edits-wikipedia-map-of-edits-to.html“)) or that geo-tagged articles related to american places are in average longer that those related to european places ([Article Quality in English Wikipedia](http://www.zerogeography.net/2011/12/article-quality-in-english-wikipedia.html „http://www.zerogeography.net/2011/12/article-quality-in-english-wikipedia.html“)).

Sketching

At the beginning of the creation process, I explored two different directions:

Coupling articles with maps, so that e.g. every paragraph can be geographically explored on a map.
Comparing geographically the content of articles.

Considering an Wikipedia article within its network

We have so far reflected on the Wikipedia article as an independent information entity, however this happens to be a narrow perspective, which needs to be extended in order to echo its networked nature. Because of this but also because of the complexity inherent to our world, no article is going to be able to cover a topic on its own without being linked to further articles, which contain complementary contents. Following this point of view, any Wikipedia article should be considered as a network node with an immediate network and could be renamed “networked article”. It echoes David Weinberg who talked about network facts as something which “exist within a web of links that make them useful and understandable“¹.

Weinberger, D. (2012). Too Big to Know: Rethinking Knowledge Now That the Facts Aren’t the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room. New York: Basic Books. ↩︎

An interface to map and compare “networked articles”

Network visualizations remain hard to decipher: they are intricate and (mostly) supply very few points of references to the readers. I attempted to address these two points by mapping networked articles on maps. Those follow an universal visualization metaphor, which offer very clear reading rules and a geographical dimension which allows to “untangle” network complexity. As this approach filters out articles without any geographical information, it only displays a partial view of the content of Wikipedia articles, which however enables comparisons.

Comparing language versions within a dedicated interface I decided to create an interface which helps to explore geographical topologies of networked articles and that also enables comparison between languages. I intended to design an explorative tool where visual contrasts tend to highlight differences or knowledge diversity. Within my project I exemplary processed wikipedia articles of towns.

Parsing the data

A Wikipedia article consists of different parts: abstract, table of contents, infobox, body of the article and [navigation boxes](http://en.wikipedia.org/wiki/Wikipedia:Navigation_templates „http://en.wikipedia.org/wiki/Wikipedia:Navigation_templates “) (for cross-navigation). All the parts are readable as one object, except the navigation boxes which are not always fully displayed by default, very much packed with links and sometimes hard to read. For this reason, I decided to exclude them from my definition of networked article, as I wanted to only process the “discernible” content to the user. This decision had heavy consequences as the Wikipedia API call which provides the outgoing links an article includes the links from the navigations boxes. The consequence was that I had to parse the article on my own to look for links.

Parsing an wikipedia article can be done with two different sources: the parsed text in html or the markup text (which reflect the way users write article). I went for the second one, as it looked cleaner to process but it proved to be a poor decision as I ran into consistency problem, amongst others. At the first glance, internal links in Wikipedia seemed to be all delimited by [[double square brackets]]. After a while however I remark that the Town twining mark syntax in the french version doesn‘t not follow this principle. Within the two articles displayed in the prototype, I edited the data I generated manually to reflect this situation.

Linking language versions together I assumed for a while that I could use [DBpedia](http://de.dbpedia.org „http://de.dbpedia.org“) to get a unique identifier to link all the language versions of a single article together. I spend a lot of time experimenting with it but unfortunately I couldn‘t get any tangible results. DBpedia proved very slow and very complex to use (to me) so that I had to look for an alternative which I found into [Wikidata](http://en.wikipedia.org/wiki/Wikipedia:Wikidata „http://en.wikipedia.org/wiki/Wikipedia:Wikidata“) which is directly done by the Wikimedia foundation and went online only recently.

Finding parameters to compare language versions In the next step, I tried to find parameters which go beyond the topological aspect and characterize the quality of the relationships between an article and its linked articles. Reflecting on the data I had and the possible computations of it, I processed the following parameters:

size of the article—the volume of text is a good indicator of the richness of the content
frequency of linking—if the “main” article links more than once to a specific article (not used in the interface)
back-linking—if the linked article link back to the “main” article. Inspired by the theory of Interpersonal ties originating from sociology, I distinguished strong links (linked article links back to the main article) from the weak links (linked article does not links back to the main article)
language version uniqueness—depending on the language version, an article may or may not be linked to a specific further article. In my endeavor to compare language versions, I distinguished the common core (articles linked in both language versions) from the language orphans (articles linked in only one of the language versions)

Final status

In order to compare language versions, I created a interface which presents two parallel and synchronized maps, accordingly displaying two isolated topologies. I decided against a single map displaying with two overlapping languages as it happens to weaken the readability and the perception of the respective topologies. To compensate the drawbacks of displaying information on two maps (no comparison by overlapping possible), I offered a visual way to highlight the links between a main article and its linked article with the “show network” function. Further this function allows the users to keep track of all connections, even if they are situated far off, outside of the current view.

As the quantity of text available remain an important factor, but is quite challenging to display directly on the map, I implemented two pie charts reflecting the text quantities corresponding to the settings chosen by the user.

Some results

I implemented two articles, Berlin and Annecy in France (my hometown), in three languages and tried to find interesting facts and discrepancies.

Annecy

I found out an [article about a shooting](http://en.wikipedia.org/wiki/Annecy_shootings „http://en.wikipedia.org/wiki/Annecy_shootings“) is prominently placed in the english version. There is no mention of it in the french version and this event happens to be barely remembered (I asked relatives and friends)
I discovered in the english version the name of an [old language which used to be spoken in the region](http://en.wikipedia.org/wiki/Arpitania „http://en.wikipedia.org/wiki/Arpitania“). I barely knew about some disappeared language, but nothing more... and it is not present in the french version.
There are a lot of strong links within the french article

Berlin

In the german version, the place which is widely knows as Haus der Kulturen der Welt in Berlin is to be found as Kongress Halle.
The english language orphans are very much related to touristic places like museums or monuments
There are almost strong links within within the different language versions

Afterward, it is always interesting to try to figure out reasons why contents are developed in such different ways.

Technical details

I tried to use a much as possible off-the-shelf solutions:

The map tiles from Stamen (Toner Lite)
Leaflet.js (to embed maps in a web page) and various plug-ins like Leaflet.Sync (to display two synchronize maps) or Leaflet.markercluster (to cluster marker on the map depending on the zoom level – not really used but its implementation is useful for potential further steps) and Arc.js (to draw the networks)
D3.js (to draw the pie charts)
Underscore.js (to work with arrays and strings)
Jquery (for the rest...)

Learnings

This project has been very intensive. It was a great start up for my master thesis and it also allowed me to pick up a little on programing and experience a comprehensive creation process on my own (the conception, the compilation of a data set and the creation of an interface). I tried a lot, failed a lot, “waste” a lot time but I enjoyed the learning curve and the result overall. I would be happy go back to this topic after my Master thesis.

Many thanks to Sebastian Meier for the support.

Useful ressources

http://www.codecademy.com/courses/web-beginner-en-vj9nh/0/1 https://en.wikipedia.org/w/api.php https://www.wikidata.org/w/api.php http://stackoverflow.com

Incom ist die Kommunikations-Plattform der Fachhochschule Potsdam

Incom ist die Kommunikations-Plattform der Fachhochschule Potsdam mehr erfahren

Geolinguistic contrasts in Wikipedia

Context—Master thesis

About Wikipedia

Looking at what has been done so far

Sketching

Considering an Wikipedia article within its network

An interface to map and compare “networked articles”

Final status

Some results

Technical details

Learnings

Useful ressources

Ein Projekt von

Fachgruppe

Art des Projekts

Betreuer_in

Entstehungszeitraum

Links