4 min read

Halfway to Somewhere

Halfway to Somewhere

This is the fourth in a series of six blog posts connected to the Outreachy internship.

I can't believe the internship is already at its midpoint. Whoa! Where did the time go? Stop the train, I want to get off! I mean, NO, I want to keep riding it!! What train? Can't you see it's a boat?! Isn't more attention to detail required from someone who wants to be a software developer or engineer or something??

Ok, serious now. I'm supposed to write about my project. What have I learned so far, and what would I do differently if I could time-travel back to day one and start over?

This post will focus on one of my internship tasks: creating a Python library to deal with those big, scary SQL dump files the Wikimedia machinery spits out a few times a month. This is to spare people from writing custom scripts every time they want to access Wikimedia project data and process it offline (instead of directly querying a database replica or API endpoint, for instance). After all, far from everyone who wants to work with this data is a developer or pure-bred data scientist. Many are researchers, teachers, journalists, Wikipedians, students..., each with a different technical background and skill level.

So what if, instead of dealing with the messiness of parsing SQL, users could focus on the data directly?

$ pip install awesome-sql-parser

import awesome-sql-parser

sql_data = awesome-sql-parser.load('scary_file.sql')
sql_data.do_magic()
For illustration purposes only, actual product may vary! Also, please don't put hyphens in real package/module names...!! 

But to reach this point of SQL bliss, it's not enough to throw some code inside a GitHub repo. Nope, it needs to be turned into a proper package and distributed through PyPI. That's how you can install it with pip. By the way – I just learned that PyPI is pronounced Pie-Pee-Eye. This means that I've been embarrassing myself for 10+ years, making it rhyme with Pie. Not like PyPI comes up in real-world conversation all the time, but still.

If you've ever programmed in Python, you're maybe familiar with the Zen of Python or PEP 20. If not, try this in your favorite REPL or IDE:

>> import this

And you will get a list containing 19 aphorisms. Here are the first four:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
...

One of these aphorisms is the famous There should be one-- and preferably only one --obvious way to do it. Unfortunately, when it comes to packaging, this is not the case. Should you configure your build in setup.py or dare the more modern pyproject.toml+ setup.cfg combo? Or have all of them? Hardcode your metadata in the setup files, or import it from a separate file? Structure your project with a src folder or without? You'd think there's some industry consensus around this. And there kind of is. But also kind of not, because what's considered best practices is ever-evolving and subject to opinion, and each alternative has trade-offs that are sometimes too subtle for us less experienced folks.

And this is just (part of) the packaging aspect of it all. On top of this, there's testing the build in different environments (tox), pre-commit hooks, docs, CI... Like Russian dolls, these things are commonly set up in a nested fashion. As an example, you can set up a GitHub Action (CI) to run each time you want to merge a branch into main. Your action workflow calls your tox tests. Tox calls pre-commit, then pre-commit calls your linters and formatters. If all goes well, the indicators turn green. If not, your off debugging. This neverending automation of testing and quality assurance workflows is part of what is called DevOps.

All this is to say that I've been Learning Python Packaging the Hard Way over the last month-and-a-half, and if I had to do it again, my workflow would be a bit different. Below is a list with some recommendations I'd send to my past self if I could. Just beware that none of it is presented as the one true way, nor is it written from the summit of Mt. Knowledge – I do know more now than a few weeks ago, however.

Writing your first Python package? Maybe this will help!

  • If you want your package to be editable (hint: you do!), you'll need at least a minimal setup.py file, even if your build info is in pyproject.toml.
  • Use a src/projectname folder for your modules. Have separate folders for tests and docs in the project's root folder. Put a (for now) empty __init__.py file in both  src/projectname and tests .
  • As soon as you've created the basic folder and file structure as well as configured setup.py and pyproject.toml, install your package locally with pip install -e . from your project's root folder. With this, you should, in theory, not run into import problems.
  • By the way, I'm happily assuming you're using a virtual environment and have initialized git already!
  • The time to set up all the code quality stuff is now. Think flake8, mypy, isort, black, etc. Whatever you've decided to use in your project, configure your pre-commit yaml file with it now.
  • Don't bother with TDD (test-driven development) unless you're experienced enough to know exactly how to structure your functions and classes from the very start. If you refactor your code over and over because you learn better ways to do it as you go, rewriting your tests all the time will become more tedious and frustrating than helpful very, very fast.
  • Proper documentation can wait until the end. As with testing, if you have to rewrite your doctrings every day because your code is constantly changing, that's no good. Keep it minimal, although not so minimal that you get lost in your own code. And think about the poor people who have to review your code.

There's, of course, much more to be said about developing a Python package than this. Still, even though I wish I'd known more when I started than I did, I also feel there's a certain value in struggle-driven development (!) when you have the luxury to indulge in it, in much the same way that there's a difference between learning by doing vs. book learning. Of course, it would have saved some time if someone had just told me, "oh, just configure your file with the parameters f, t, and x," instead of me painstakingly going through the whole alphabet. But now I also know something about a, and b, and all the way up to z. And that, to me, is priceless.