In this reading, we'll learn how to create plots from Pandas data. Pandas uses a module called matplotlib to create plots. The matplotlib library is designed to resemble MATPLOT (a programming language for matrices and environment that support visualization).
While we could import matplotlib and make function calls directly to plot data, many Pandas methods for Series and DataFrame objects make this easier. The documentation gives a nice overview of this integration here with more examples than provided here.
Let's begin by trying to make a pie chart from a Pandas Series.
import pandas as pd
from pandas import Series, DataFrame
# first we'll create a Series with three numbers
s = Series([5000000, 3000000, 2000000])
s
# there are a bunch of methods of the form Series.plot.METHOD for plotting.
# suppose we want a pie plot:
s.plot.pie()
Oops! That's not what we wanted. We created a plot, but it didn't get rendered in the notebook. It turns out that matplotlib is integrated with Jupyter Notebooks, and sometimes we need a special command to tell Jupyter we want to render plots inline. Special Jupyter commands begin with a percent sign ("%"). We recommend putting the following at the beginning of all your notebooks (it's a Jupyter command, not Python code, so it won't work in a regular .py file if you were to try that):
%matplotlib inline
Ok, let's try plotting again.
s.plot.pie()
Now we're getting somewhere! Of course, there are still many issues with this plot (you should adopt the mindset of a critic when we're making plots):
Let's address some of the issues we just saw. First, let's increase the font size. To do this, we'll import matplotlib directly, and change the default size. All the defaults are in a dictionary named rcParams
in the matplotlib
module.
import matplotlib
matplotlib.rcParams["font.size"]
Let's increase to size 16.
matplotlib.rcParams["font.size"] = 18
Second, we can pass a figsize
tuple argument to specify the (width, height) in inches. Let's make the pie chart a 6-by-6 inch square.
s = Series([5000000, 3000000, 2000000])
s
s.plot.pie()
Great! What about the absolute quantities? 95% of the time, it's best to replace a pie chart with a bar plot.
s.plot.bar()
Those x-axis labels are coming from the Series index, which goes 0, 1, 2 (because we created the Series from a list). Let's create the Series from a dictionary to get better categories on the x-axis.
s = Series({"Police":5000000, "Schools":3000000, "Fire":2000000})
s
s.plot.bar()
What is the y-axis measuring, and how big are those numbers? Counting zeros is annoying!
Whenever we call Series.plot.plotting_function
(where plotting_function
might be pie
, bar
, or similar), it returns an AxesSubplot object. We can call various methods on that to tweak the plot.
millions = s / 1e6
ax = millions.plot.bar()
ax.set_title("Annual Spending")
ax.set_ylabel("Dollars (Millions)")
The above is a fine plot, but remember we're being critics! A few things would help:
You should all read Edward Tufte's books (over break?) to start forming your philosophy of plotting: https://www.edwardtufte.com/tufte/books_vdqi
ax = millions.plot.bar(figsize=(1.5,5), color="0.5") # 0 is black, 1 is white, 0.5 is halfway between
ax.set_title("Annual Spending")
ax.set_ylabel("Dollars (Millions)")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
Once you have a style you like, you should create a function so that many plots can be similar. One way to do this is have a function that creates an AxesSubplot object and returns it. The pandas plotting functions can this re-use this customized space. If we import pyplot from matplotlib, we can write such a function. For example:
from matplotlib import pyplot as plt
def get_ax(figsize=(4,4)):
fig, ax = plt.subplots(figsize=figsize)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
return ax
ax=get_ax((1.5, 4))
ax.set_title("Annual Spending")
ax.set_ylabel("Dollars (Millions)")
millions.plot.bar(ax=ax, color="0.5")
We can also create a horizontal bar plot by replacing bar
with barh
and switching the axis-related calls:
ax=get_ax((4, 1.5))
ax.set_title("Annual Spending")
ax.set_xlabel("Dollars (Millions)")
millions.plot.barh(ax=ax, color="0.5")
In this example, we want to show how many people ride the 5 most popular bus routes in Madison, relative to overall ridership. We'll pull the data from our bus.db database we've used in previous examples.
import sqlite3
c = sqlite3.connect('bus.db')
# let's preview the data
pd.read_sql("SELECT * from boarding LIMIT 10", c)
# we want to see the total ridership per bus route
df = pd.read_sql("SELECT Route, SUM(DailyBoardings) as ridership " +
"FROM boarding " +
"GROUP BY Route " +
"ORDER BY ridership DESC", c)
# let's peek at the first few rows in the results from our query
df.head()
Now's a good time to stop and think about what form the data is in, and what form we want to get it to.
What we have: a DataFrame of routes and ridership, indexed from 0.
What we want: a Series of the top 5 buses, with route numbers as the index, and ridership as the values.
Why do we want such a Series? Because when we call Series.plot.bar(...)
we want a bar plot with five slices. Each pie should be labeled as a bus route (and slice labels are pulled from the index of a Series), and the size of the Series should correspond to ridership (and slice sizes are based on the values in a Series).
The first step to getting the data in the form we want is to re-index df
so that the route numbers are in the index (instead of 0, 1, 2, etc). We can do this with the DataFrame.set_index
function.
# set_index doesn't change df, but it returns a new
# DataFrame with the desired column as the new index
ridership_df = df.set_index("Route")
ridership_df.head()
# we can pull the (only) ridership column from that DataFrame out
# and keep it as a Series.
ridership = ridership_df['ridership']
ridership.head()
Great! Now we have the data in a plottable form. Let's make the pie chart.
ridership.plot.bar(ax=get_ax())
This is somewhat close to the form we want. But we only wanted the top 5 routes (so that we can actually see what is going on!).
ridership.head(5).plot.bar(ax=get_ax())
Not bad, but we would ideally have an "other" category that captures all the routes besides the 80, 2, 6, 10, and 3. How many routes are in this other category?
other_ridership = ridership[5:].sum()
other_ridership
Now, we want to pull out the top 5 to a new Series, then add the other category.
top5 = ridership[:5]
top5["other"] = other_ridership
top5
That's exactly what we want! The ridership of the top 5 routes, and the remaining ridership spread across other routes. Let's plot it.
ax = get_ax()
(top5 / 1000).plot.bar(color="k", ax=ax) # "k" is black (because "b" was taken for blue)
ax.set_ylabel("Riders (thousands)")
ax.set_title("Madison Daily Bus Use")
This is exactly what we want. We can see the top route (the 80) is responsible for about one fifth of the ridership. The top 5 routes together are responsible for almost half of all ridership (48%, to be exact). To wrap up, let's make sure we close our connection to bus.db.
c.close()
A scatter plot displays a collection of points along an x-axis and y-axis. Whereas pie charts are one-dimensional (we want to see a distribution of one value, such as ridership), scatter plots are naturally two dimensionals (each point has both an x and y position). Thus, scatter plots are generated from DataFrames (in contrast, pie charts are generated from a Series).
Just as there are a collection of Series.plot.METHOD
methods, there are also a collection of DataFrame.plot.METHOD
methods (scatter
is one of those methods).
Let's begin by plotting some young trees. Each tree has an age (in years), a height (in feet), and a diameter (in inches).
trees = [
{"age": 1, "height": 1.5, "diameter": 0.8},
{"age": 1, "height": 1.9, "diameter": 1.2},
{"age": 1, "height": 1.8, "diameter": 1.4},
{"age": 2, "height": 1.8, "diameter": 0.9},
{"age": 2, "height": 2.5, "diameter": 1.5},
{"age": 2, "height": 3, "diameter": 1.8},
{"age": 2, "height": 2.9, "diameter": 1.7},
{"age": 3, "height": 3.2, "diameter": 2.1},
{"age": 3, "height": 3, "diameter": 2},
{"age": 3, "height": 2.4, "diameter": 2.2},
{"age": 2, "height": 3.1, "diameter": 2.9},
{"age": 4, "height": 2.5, "diameter": 3.1},
{"age": 4, "height": 3.9, "diameter": 3.1},
{"age": 4, "height": 4.9, "diameter": 2.8},
{"age": 4, "height": 5.2, "diameter": 3.5},
{"age": 4, "height": 4.8, "diameter": 4},
]
df = DataFrame(trees)
df
Let's plot this data and see if there seems to be any connection between tree age and tree height. We can create plots like this: df.plot.scatter(x=FIELD1, y=FIELD2)
.
# you can choose which field is represented on the x-axis
# and which is represented on the y-axis.
df.plot.scatter(x='height', y='age', ax=get_ax(figsize=(8, 4)))
Although the above plot is informative (we can see that older trees are generally taller), it's not the easiest way to visualize the information. In general, people are accustomed to seeing time-related data on the x-axis (age is a type of time). Thus, a more intuitive plot would reverse the axes:
df.plot.scatter(x='age', y='height', ax=get_ax(figsize=(8, 4)))
We can also control the color (with the c
argument) and size (with the s
argument) of the points:
df.plot.scatter(x='age', y='height', c='black', s=50, ax=get_ax(figsize=(8, 4)))
If we want, we can also use our data to determine the size and color of each point by passing a series for these. For example, suppose we want tall trees to be represented by large, black circles, and we wanted short trees to be represted with small gray dots. We can pull out a display Series to control this.
display = df['height'] * 25
display
df.plot.scatter(x='age', y='height', c=display, s=display, ax=get_ax(figsize=(8, 4)))
matplotlib scaled the colors such that the smallest corresponds to white and the largest corresponds to black. This means we can't see the smallest on the white background. We can use vmin
and vmax
params to choose our own limits, making sure all data falls somewhere the visible range:
ax = get_ax(figsize=(8, 4))
df.plot.scatter(x='age', y='height', c=display, s=display, vmin=display.min()-50, ax=ax)
Bigger numbers in the display
Series (or any Series used with c=
) results in darker dots, and bigger numbers also determine the size of the dots.
The above plot is an example of a reduntant visualization. It's reduntant because three characteristics of the plot (x-axis, dot color, and dot size) are all being used to communicate the same characteristic of the data (height). This is a wasteful use of plot characteristics, considering that we didn't communicate anything about tree diameter.
When you're thinking about how to visualize your data, you should list the interesting attributes of your data as well as the dimensions you can communicate with the plot. Then carefully consider how to use each dimension of your plot to communicate something interesting.
In this case, we have three attributes in our data:
We also have four dimensions we can control in our plot:
Which dimensions should communicate which attributes? Exact opinions will vary, but some combinations are more effective than others.
One good combination will be x-axis for age, y-axis for height, and dot size for diameter. As mentioned before, the x-axis is often used to communicate an elapsed time (an age). It also feels very natural for a vertical axis to communicate height. Finally, dot size seems like a better fit for diameter than color for two reasons. First, it is intuitive to pair a spatial attribute with a spatial characteristic of the plot. Second, color is often a tricky dimension to use. Gray can be too light to see, printed copies will very in how good they look, and finally you need to think about how accessible your plots will be for color-blind readers.
Let's see how the data looks with our final choices:
ax = get_ax(figsize=(8, 4))
df.plot.scatter(x='age', y='height', s=df['diameter'], c='black', ax=ax)
Those dots are a little small, and the visualization is relative (there's no scale specifying what a given dot size means), so we're free to multiply the diameter by a number of our choosing to make it more aesthetically pleasing.
ax = get_ax(figsize=(8, 4))
df.plot.scatter(x='age', y='height', s=df['diameter'] * 40, c='black', ax=ax)
An Iris is a type of flowering plant. There is a very popular dataset (often used in machine learning examples) containing a description of the dimensions of 150 Iris plants from 3 different types of Iris.
The dataset is at https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data, and a description of the data is here. We will plot the data for each of the three types of Iris to visually identify patterns in the data.
As a first step, let's try fetching and loading the CSV data.
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
df.head()
Oops! There's no CSV header in that file, so we have to tell Pandas what each field means. From the documentation, we see the following:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
Let's try again using header=
.
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
names=['sepal-len', 'sepal-wid', 'petal-len', 'petal-wid', 'name'])
df.head()
Great! Now lets see how many flowers there are of each type. We can use the Series.value_counts
method on the name
column Series. You should think of value_counts
as equivalent to a GROUP BY
with a SUM
in SQL.
iris_types = df["name"].value_counts()
iris_types
for name in iris_types.index:
rows = df[df['name'] == name]
rows.plot.scatter(x='petal-len', y='petal-wid', title=name)
What if we want just one plot showing all three flower types? In order to distinguish, we would want the dots for each flower type to be a different color. Not only can we pass the same AxesSubplot
object to each plot call, but a call where we don't pass it returns a new one!
ax = df[df['name'] == 'Iris-setosa'].plot.scatter(x='petal-len', y='petal-wid', c='blue')
df[df['name'] == 'Iris-versicolor'].plot.scatter(x='petal-len', y='petal-wid', c='green', ax=ax)
df[df['name'] == 'Iris-virginica'].plot.scatter(x='petal-len', y='petal-wid', c='black', ax=ax)
ax = get_ax(figsize=(8, 4))
df[df['name'] == 'Iris-setosa'].plot.scatter(x='petal-len', y='petal-wid', c='blue', ax=ax)
df[df['name'] == 'Iris-versicolor'].plot.scatter(x='petal-len', y='petal-wid', c='green', ax=ax)
df[df['name'] == 'Iris-virginica'].plot.scatter(x='petal-len', y='petal-wid', c='black', ax=ax)
From this plot, we can make several observations:
What about sepal size? We leave that as an exercise for you!
In this reading, we have learned how to create a bar from a Pandas Series and a scatter plot from a Pandas DataFrame. We have also discussed the decision making process for choosing which plot characteristics should represent which data attributes. Finally, we learned how about AxesSuplot objects and how to use them to plot multiple datasets in the same region. We did all of this in the context of two example datasets: the Madison Metro dataset and the popular Iris dataset.