Tutorial Thursday: Jupyter Notebooks

Every Thursday, our advent calendar turns into a Journocode tutorial. Today, squirrel Elena explains how Jupyter Notebooks can help you with your data wrangling.

When I start a new project, I usually open a Jupyter Notebook. It is one of my favourite tools to quickly inspect and wrangle data. Jupyter Notebooks are very flexible: they mix code, data and visualization and, on top of that, they make it extremely easy to comment and write in your scripts. That is also why they are great for learning new programming skills.

In fact, this tutorial itself is a Jupyter Notebook.

What is a Jupyter Notebook?

Jupyter Notebook is a development environment: basically, when you program, you chose a programming language to write your code in. When you want to actually write the code, you can just type in the command line or a simple text editor. Yet, it might be a whole lot more comfortable to use a more advanced editor: a development environment. They come in all shapes and sizes, from more advanced editors like Sublime or Atom, to complex high level environments like PyCharm or Eclipse. Some of them are suited to just one programming language, others work for several different languages.

What makes Jupyter stand out is it's interactivity. I find it most similar to RStudio, a popular development environment for R. You can define variables and execute parts of your code and then continue your programming without re-executing the entire program. This is why I really like to use Jupyter Notebooks for prototyping. It is very quick to use and optimized for working with data. However, when I am working on a bigger project, I will usually go back to a different environment once I am done with my first data inspection.

Jupyter Notebooks can be used for a whole range of programming languages: from JavaScript to R to Haskell, there most likely will be a version for Jupyter. However, I personally like Python. This tutorial is not about the language, so if you want to use a different programming language, all the learnings should still apply. However, the examples in this tutorial will be Python code.

Getting started

While Jupyter Notebooks are suitable for different programming languages, you will need to install Python before you can install Jupyter. You can find a neat tutorial on how to install both here: http://jupyter.org/install

In fact, there are two options to use Python in Jupyter Notebooks. While they are recommending to use anaconda, I personally prefer to use pip. However, that might just be a matter of taste.

Once everything is up and running, use your command line to navigate to the folder you want to work in (using cd) and then type jupyter notebook to start jupyter.

In [ ]:
cd foldername
jupyter notebook

You will get to the starting page:

Just click "New" in the top right corner, choose your programming language (I will use Python3) and a new Jupyter Notebook will open.

In there, you will find an empty cell, just like the one below. Just type in your code and you're already set. To execute what you stated, just press Shift and Enter and the result will pop up underneath.

In [30]:
print('hello world')
          
hello world
          

Congratulations, you can now use Jupyter Notebooks for programming.

Programming in Jupyter

I like to reserve the first cell of a notebook for all my imports. This way, I can load all necessary libraries in the beginning. When I return to my notebook, I just press Shift and Enter and all the libraries are loaded at once. When I'm working with data this usually means loading pandas, numpy and matplotlib, so let's try:

In [5]:
#import pandas, numpy and matplotlib
import pandas
import numpy as np
from matplotlib import pyplot as plt

If you want to use a package, that you haven't worked with before, you first have to download it. If you have programmed in Python before, you will be familiar with this process. While there are several ways to do so, I like using pip. You can either install the package from outside — just install everything as you would usually from your command line — or you can run commands from inside Jupyter by typing an exclamation mark before the command. (This works, of course, for all commands and is not limited to installing python packages.)

In [4]:
!pip3 install pandas
          
Requirement already satisfied: pandas in /Users/erdmann/.local/share/virtualenvs/tutorials-wJk9OwBc/lib/python3.7/site-packages (0.23.4)
Requirement already satisfied: numpy>=1.9.0 in /Users/erdmann/.local/share/virtualenvs/tutorials-wJk9OwBc/lib/python3.7/site-packages (from pandas) (1.15.4)
Requirement already satisfied: pytz>=2011k in /Users/erdmann/.local/share/virtualenvs/tutorials-wJk9OwBc/lib/python3.7/site-packages (from pandas) (2018.7)
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/erdmann/.local/share/virtualenvs/tutorials-wJk9OwBc/lib/python3.7/site-packages (from pandas) (2.7.5)
Requirement already satisfied: six>=1.5 in /Users/erdmann/.local/share/virtualenvs/tutorials-wJk9OwBc/lib/python3.7/site-packages (from python-dateutil>=2.5.0->pandas) (1.11.0)

You see that I already have pandas installed on my machine. That is no surprise, since this package is extremely useful for any python programmer working with data.

Good news: Jupyter Notebooks really are optimized for working with data and have great integrations for pandas and matplotlib. Let's use pandas to build a dataframe with all the adventcalendar doors so far:

In [7]:
#create a data frame with the guest posts in the Journocode advent calendar

df = pandas.DataFrame([[1,'How visual journalism can explain complex problems','Alvin Chang' ],
[2,'Squirrel talk: "Make it clear that you are not their personal system administrator"','Simon Haas'],
[3,'Making of "20 years 20 titles" – a data journalistic analysis of Roger Federer’s career','Angelo Zehr'],
[4,'A case for more sophisticated visuals','Gianna-Carina Grün'],
[5,'Elepost – a crowdsourced analysis of electoral posters ','Maximilian Zierer']],
columns=['day','title','author'],)

df.head()
Out[7]:
day title author
0 1 How visual journalism can explain complex prob... Alvin Chang
1 2 Squirrel talk: "Make it clear that you are not... Simon Haas
2 3 Making of "20 years 20 titles" – a data journa... Angelo Zehr
3 4 A case for more sophisticated visuals Gianna-Carina Grün
4 5 Elepost – a crowdsourced analysis of electoral... Maximilian Zierer

See how neatly Jupyter displays that dataframe?

You might have noticed that I wrote only df.head() and did not tell Jupyter to print it. However, the dataframe is displayed in the output cell. Whenever you put something that can be printed in the last row of a cell of a notebook, it will appear in the output.

As mentioned before, one of the main advantages of Jupyter Notebooks is their interactivity. Now that we have loaded the posts into the dataframe df, we can access them in every other cell that we're using, for example to print out a list of all the authors:

In [9]:
#get all the author's names
df['author']
Out[9]:
0           Alvin Chang
1 Simon Haas
2 Angelo Zehr
3 Gianna-Carina Grün
4 Maximilian Zierer
Name: author, dtype: object

One advantage of coding in several cells is that when you're defining variables through more complicated and longer calculations, Jupyter will keep them set when you continue with your programming. That way, you don't have to run the calculations again when you need the variables. So, you might load and clean a big dataset, and while you are still trying to fix a plot, the data will stay there at your fingertips.

Jupyter doesn't do code completion, but when you call a library and then press Tab after the ".", it will show you all available functions. You can also check out the documentation by using a question mark in front of a function, for example ?pandas.DataFrame .

Writing in your notebook

Typing in Jupyter is not limited to coding though. See that little drop down box on the top right? By default, it says Code. But if you change it to Markdown, you can write text (like I did in this tutorial) and even use html markup to style your writing. For instance

In [4]:
<font color='#09aaae'><b>Journocode</b></font>
          

will give you Journocode — in our custom color. If you are writing a heading, you can start the line with a hashtag and a space ("# ") and everything you write in that line will be big and bold.

If you are a real nerd like me, you might be happy to hear that Jupyter even works with Latex. To get the Bayes' Theorem

$ P( A \mid B ) = \frac{P(B \mid A)\, P(A)}{P(B)} $,

you need to type

In [ ]:
\\( P(A \mid B) = \frac{P(B\midA)\,P(A)}{P(B)}\\)
          

Finally, you might want to save and share your notebook. When you click File -> Download as, you will get several options: you might save it as a Jupyter Notebook or export either just the Python code or the whole document as a html file. I used the latter to write this tutorial.

One more surprise: Jupyter Notebooks can also do magic! There are a couple of useful and quick commands. Check them out for yourself by calling:

In [6]:
%magic
          


About

Elena Erdmann

Elena is a data journalist at Zeit Online and a data journalism trainer at Journocode. Her background is in theoretical computer science and maths, yet today she prefers to apply her data skills on societal issues.

Runs on:

How many times per week do you have to explain what "data journalism" is?
I changed to data journalism so that I could finally work in a field that people would actually understand.

How many screens do you have on your desk?
Two and a smartphone. Yet, I always look at the small Laptop screen only, the big one never worked for me.

How many items are on your desktop?
Right now? 115. Don't ask me about the number of browser tabs.

Swear words per day?
The only time I swear every day is when I'm on my bike.

Your funniest file name?
Checking my desktop right now. I think diskokugel.jpg and whistleblower.gif could compete for that..

snow flake
© 2018 Journocode