The overarching goal of my internship project is to make access and use of Wikipedia data easier for researchers, and anyone else interested in it. One of the first steps is therefore to understand how Wikipedia is used for research in the wild. And as I need to keep at least a semi-structured record of the things I learn along the way, I figured I might as well do it here on this very blog. What follows is an informal review of some of the scientific literature from the last 5 years where data from Wikipedia has been used.
Maybe one of the most obvious use cases of Wikipedia data is for Natural Language Processing (NLP) applications. In a paper from 2017, a group of Stanford researchers (Chen et al.) uses Wikipedia articles as the unique knowledge source for the task of question answering (QA). The image below gives a high-level overview of how this works.
The Document Retriever is described as
a module using bigram hashing and TF-IDF matching designed to, given a question, efficiently return a subset of relevant articles.
As an interesting side-note, the authors claim that Document Retriever outperforms the built-in Wikipedia search engine.
How did the researchers access the data?
Footnote 3) refers to the WikiExtractor script, a tool for extracting plain text from Wikipedia dumps created and (well) maintained by Giuseppe Attardi, an NLP researcher at Università di Pisa, Italy. This tool currently has over 2.7k stars and 776 forks on GitHub so there clearly is a "market" for utilities like this.
On to the next research paper. This one too is written by researchers from Stanford (Sheehan et al.) and the title is pretty descriptive: Predicting Economic Development using Geolocated Wikipedia Articles. From the introduction:
While this paper is on a very different topic and seems mostly unrelated to the previous one, it also uses data obtained by parsing a Wikipedia data dump to obtain the raw text contained in the articles and then uses NLP techniques on it. The main difference here is that only the geolocated articles were extracted. Unfortunately, the paper doesn't go into more detail about exactly how the data was parsed.
The two research papers discussed so far have both used article text content as the source for constructing their corpora. The next one takes advantage of Wikipedia's edit history. Learning To Split and Rephrase From Wikipedia Edit History (Botha, Faruqui et al.) is a paper from 2018 by a group at Google AI Language.
So these guys have been mining Wikipedia's edit history to create the WikiSplit dataset. How? The process is described somewhat superficially but still illustrates an interesting use-case:
Before moving on from NLP-related research, let's briefly zoom out a bit to take a look at the wider picture. Although somewhat dated, Wikipedia research and tools: review and comments (2012) by F.A. Nielsen provides many valuable insights into Wikipedia research, among which is this piece of information:
In July 2010 Google Scholar claimed to return 196,000 articles when queried about Wikipedia
Let's see what this number looks like in the present day, May 2021.
Over 2 million results. Wow, in about ten years, this number has grown by a factor of 10. Not bad! What things other than NLP can Wikipedia data be used for?
Going back to our informal literature review, our next stop is a paper from 2017 entitled What Makes a Link Successful on Wikipedia? by Dimitrov, Singer et al. from GESIS - Leibniz Institute for the Social Sciences. Now we're talking network analysis. From the abstract:
While a plethora of hypertext links exist on the Web, only a small amount of them are regularly clicked. Starting from this observation, we set out to study large-scale click data from Wikipedia in order to understand what makes a link successful. We systematically analyze effects of link properties on the popularity of links.
The authors further state:
Even though links are omnipresent on the Web, only a minority of them get regularly clicked by humans. For example, on Wikipedia only around 4% of all existing links are clicked by visitors more frequently than 10 times within a month.
Clearly, this type of analysis demands a different approach than "just" constructing a corpus out of the text content of Wikipedia articles. Let's dig deeper into the article to find out more.
Wikipedia pages are connected by links for which we can compute a variety of features that are categorized in network features (e.g., target article’s out-degree), semantic features (e.g., text similarity between source and target articles), and visual features (e.g., position of the link on the screen). This work aims at understanding what makes a link successful—i.e., which link properties best explain observed numbers of user transitions.
Before reading the results of the experiment, it could be fun to try to guess what makes a link "successful". Based on my own (totally representative) Wikipedia browsing behavior, I'd say visual features are the least important. If I've arrived at a Wikipedia page, it's mostly because I either know very little, or even nothing, about the topic, in which case I'll mainly read the lead section, or because I'm looking for a specific piece of information, in which case I'll try to navigate to the relevant section with the help of the table of contents. If I read the lead section and find links to unfamiliar concepts, I will sometimes hover over them to read the pop-up summary or click on the link directly. In that sense, links at the top of the page are of course more visually prominent but could it really be said that this is why they get more clicks (if they do)? Or are they simply more relevant to the topic in a broader sense, which is why they are found at the top?
The results are interesting, albeit not very surprising:
We provide empirical evidence that Wikipedia users have a preference of choosing links leading to the periphery of underlying topological link network, that they prefer links leading to semantically similar articles, and that links positioned at the top and left-side of an article have a higher likelihood of being used.
There's of course much more to the study than just this summary of a conclusion, but I'm mainly interested in what data the researchers used and how they accessed and processed it. This is described in the article itself, and there's also a link to a GitHub repo with all the code. This is awesome! Essentially, the authors created a whole custom framework for parsing Wikipedia links and creating a SQL database with the extracted data.
In this work, we focus on all articles contained in the main namespace of the English Wikipedia as extracted from the public XML dump from March 2015.
To obtain authentically rendered pages, we retrieved the corresponding static HTML pages by using Wikimedia’s public API. In contrast to using readily available link dumps, this allowed also for considering links that are indirectly included in a page, e.g., by templates. A tiny part (< 0.01%) of articles could not be retrieved and had to be excluded from the analysis. With this data, we created the Wikipedia link network Dwiki using articles as nodes and unique links as directed edges; Dwiki contains ∼4.8 million articles connected by ∼340 million distinct links.
For measuring actual usage of links, we utilize openly available transition data from Wikipedia from February 2015 . It contains aggregated page requests extracted from the server log for the English desktop version of Wikipedia in the form of (referrer, resource) pairs, i.e., transitions, and their respective transition counts. The data has already been pre-processed to filter bots and web crawlers and transitions occurring less than 10 times, see  for details.
So the authors accessed three different data sources: dumps, clickstream data, and APIs.
There's now a MediaWiki library with utilities for parsing XML dumps and retrieving data from the APIs, among other things. It already existed at the time of this study and it would seem that using some of these utilities instead of creating a whole framework from scratch would have made the authors' lives easier. Why didn't they use the existing modules? Did they not know about them? Or are the modules missing some of the functionality the study needed? Hard to know without directly asking the authors, but this is something to ponder.
With this, our exploration comes to an end. This was just a handful of all the interesting studies that exist. Judging by current trends and as machine learning is becoming increasingly ubiquitous and new data sets are created all the time, it is clear that Wikipedia has a bright future as a data resource for research. Why not head over to Google Scholar to see for yourself!