3 min read

Hello Outreachy!

Behold, the first sentence of the first post of the first blog ever written by me. Incidentally, it's five o'clock on a public holiday morning. The rest of the family is sleeping like babies. Not even the birds are up singing yet. But hey, I've just guzzled down a big, big, BIG cup of black coffee, straight from the mountains of Peru. And a glass of water. Which is to say that nothing can stop me now.

The secret to early mornings and everything else

Let's set things straight: there's a lot of other things I could be doing right now. Surely, I didn't just wake up this morning to start a blog out of the blue. One week ago, I found out I was one of 71 lucky interns accepted to this summer's Outreachy open-source software internship. And writing blog posts every two weeks is part of the internship deal.

I'll be spending the summer working on a project for the Wikimedia Foundation, most famously known for the modern monument to knowledge that is Wikipedia. You've probably copy-pasted something from there at various points in your life, in one or more of over 300 languages. It's ok, we all have.

Relevant xkcd

In case you hadn't noticed, this blog post isn't overly serious. But the next thing I'm going to say absolutely is: Wikipedia is an amazing feat of humanity. As of 23 May 2021, there are 6,303,403 articles in the English Wikipedia alone, containing over 3.9 billion words, contributed by volunteers. The number of articles in Wikipedia is increasing by over 17,000 a month. By the time I'll have finished writing this post, over 25 new articles will have popped up that didn't exist when I started.

Unsurprisingly then, Wikipedia contains huge amounts of data and metadata accessible through a variety of databases, APIs, XML and SQL dump files, etc., all publicly available. How awesome isn't this? You'd just need to dive right in and start crunching the data, right?

Like most of life's interesting questions, the answer to this one is yes and no. For starters, there's so much data in so many different formats that if you have a very specific research question in mind, it's not always easy to know exactly which data to get from where and how to then process, say, a couple 17 GB compressed files to get just the bits and pieces of information that interest you.

And this is where I come in, on a battle horse that weirdly looks like Python, wielding a mighty sword a bunch of Jupyter notebooks capable of slaying even the ugliest of SQL dragons. Yeah!

True depiction of SQL dragon as a baby, ca. 1974

In words that make more sense, I'll be creating a set of tools and tutorials to make working with Wikipedia data more intuitive and user friendly. And I'm really, really looking forward to get down to business. More technical, serious posts will soon follow (promise!) so stay updated, and stay safe.