Line and Bar Plots

Tyler Caraza-Harter

Previously, we learned how to create matplotlib pie charts and scatter plots by calling Pandas plotting methods for Series and DataFrames.

In this document, we'll also learn how to also create line plots and bar plots.

Let's start by doing our matplotlib setup and usual imports:

In [1]:
%matplotlib inline
In [2]:
import pandas as pd
from pandas import Series, DataFrame

For readability, you may also want to increase the default font size at the start of your notebooks. You can do so by copy/pasting the following:

In [3]:
import matplotlib
matplotlib.rcParams.update({'font.size': 15})

Line Plot from a Series

We can create a line plot from either a Series (with s.plot.line()) or a DataFrame (with df.plot.line()).

In [4]:
s = Series([0,100,300,200,400])
s
Out[4]:
0      0
1    100
2    300
3    200
4    400
dtype: int64
In [5]:
s.plot.line()
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x116ea8b00>

The y values are clearly the values in the Series, but where are the x-values coming from? You guessed it, the Series' index. Let's try the same values with a different index.

In [6]:
s = Series([0,100,300,200,400], index=[1,2,30,31,32])
s
Out[6]:
1       0
2     100
30    300
31    200
32    400
dtype: int64
In [7]:
s.plot.line()
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x11721c668>

Now we see that the plot starts from 1 (instead of 0) and a bigger gap in the index (between 2 and 30) corresponds to a bigger line segment over the x-axis.

What happens if our index is not in order?

In [8]:
s = Series([0,100,300,200,400], index=[1,11,2,22,3])
s
Out[8]:
1       0
11    100
2     300
22    200
3     400
dtype: int64
In [9]:
s.plot.line()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1172df550>

Oops! That's probably not what we want. 99% of the time, people making a line plot want readers to be able to lookup a single y-value (per line) given a point along the x-axis. So even though this line passes through all of our data points, the lines between the points are very misleading.

If your data isn't already sorted, you'll probably want to sort it by the index first:

In [10]:
s.sort_index()
Out[10]:
1       0
2     300
3     400
11    100
22    200
dtype: int64

Don't get confused about this function! If we have a Python list L and we call L.sort(), the items in L are rearranged in place and the sort function doesn't return anything.

In contrast, if we have a Pandas Series s and we call s.sort_index(), the items in S are not moved, but the sort_index function returns a new Series that is sorted. So if we print s again, we see the original (unsorted) data:

In [11]:
s
Out[11]:
1       0
11    100
2     300
22    200
3     400
dtype: int64

Because sort_index() returns a new Series and we can call .plot.line() on a Series, we can do the following on an unsorted Series s in one step:

In [12]:
s.sort_index().plot.line()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1173b2198>

Line Plot from a DataFrame

In addition to the Series.plot.line() method, there is also a DataFrame.plot.line() method. Whereas the line function for a Series creates a plot with a single line, the line plot for a DataFrame draws a line for each column in the DataFrame (remember that each column in a DataFrame is essentially just a Series).

Let's try with a DataFrame containing temperature patterns for Madison, WI. The data was copied from https://www.usclimatedata.com/climate/madison/wisconsin/united-states/uswi0411, and contains the typical daily highs and lows for each month of the year.

In [13]:
df = DataFrame({
    "high": [26, 31, 43, 57, 68, 78, 82, 79, 72, 59, 44, 30],
    "low": [11, 15, 25, 36, 46, 56, 61, 59, 50, 39, 28, 16]
})

df
Out[13]:
high low
0 26 11
1 31 15
2 43 25
3 57 36
4 68 46
5 78 56
6 82 61
7 79 59
8 72 50
9 59 39
10 44 28
11 30 16
In [14]:
df.plot.line()
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1174908d0>

Not bad! We can see the temperatures vary througout the year, with highs correlated with lows. But what is the x-axis? What is the y-axis?

Remember that calling an AxesSubplot object. There are AxesSubplot.set_xlabel and AxesSubplot.set_ylabel functions that will help us out here. Just to make sure to call them in the same cell where .plot.line is called, or the plot will be displayed before they can have an effect.

In [15]:
ax = df.plot.line()
ax.set_xlabel('Month')
ax.set_ylabel('Temp (Fehrenheit)')
Out[15]:
Text(0,0.5,'Temp (Fehrenheit)')

What if we want the plot in Celcius? That's easy enough with some element-wise operations.

In [16]:
c_df = DataFrame()
c_df["high"] = (df["high"] - 32) * (5/9)
c_df["low"] = (df["low"] - 32) * (5/9)
c_df
Out[16]:
high low
0 -3.333333 -11.666667
1 -0.555556 -9.444444
2 6.111111 -3.888889
3 13.888889 2.222222
4 20.000000 7.777778
5 25.555556 13.333333
6 27.777778 16.111111
7 26.111111 15.000000
8 22.222222 10.000000
9 15.000000 3.888889
10 6.666667 -2.222222
11 -1.111111 -8.888889
In [17]:
ax = c_df.plot.line()
ax.set_xlabel('Month')
ax.set_ylabel('Temp (Celsius)')
Out[17]:
Text(0,0.5,'Temp (Celsius)')

That's looking good!

One small thing: did you notice the extra print above the plot that says Text(0,0.5,'Temp (Celsius)')? That happened because the call to set_ylabel returned that value. We could always put None at the end of our cell to supress that:

In [18]:
ax = c_df.plot.line()
ax.set_xlabel('Month')
ax.set_ylabel('Temp (Celsius)')
None

Tick Labels

The above plot would be nicer if we saw actual month names along the y-axis. Let's create a DataFrame with the same data, but month names for the index.

In [19]:
df = DataFrame({
    "month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"],
    "high": [26, 31, 43, 57, 68, 78, 82, 79, 72, 59, 44, 30],
    "low": [11, 15, 25, 36, 46, 56, 61, 59, 50, 39, 28, 16]
})

df = df.set_index("month")

df.head()
Out[19]:
high low
month
Jan 26 11
Feb 31 15
Mar 43 25
Apr 57 36
May 68 46

Let's try plotting it.

In [20]:
ax = df.plot.line()
ax.set_xlabel('Month')
ax.set_ylabel('Temp (Fehrenheit)')
None

Unfortunately, even though we now have months for the index, matplotlib won't use them for the x-axis unless we specifically tell it to. We can explicitly give matplotlib tick labels with the set_xticklabels method.

In [21]:
# careful, this is an example of a bad plot!
ax = df.plot.line()
ax.set_xticklabels(df.index)
None

Yikes! That's not what we wanted at all. The above plot starts at Feb (instead of Jan), and it only covers half a year. We've set the tick labels, but not the tick positions. Let's take a look at the positions:

In [22]:
ax.get_xticks()
Out[22]:
array([-2.,  0.,  2.,  4.,  6.,  8., 10., 12.])

You should read the above as follows:

  • the first tick label (Jan) is drawn at position -2, which is out of the plots range (so we don't see Jan)
  • the second tick label (Feb) is drawn at position 0 (the leftmost)
  • the third tick label (Mar) is drawn at position 2
  • and so on

Fortunately, we can set the tick positions explicitly. The only correct configuration in this case is 0, 1, 2, 3, ...

In [23]:
ax = df.plot.line()
ax.set_xticks([0, 1, 2, 3])
ax.set_xticklabels(df.index)
None

If we want to count from 0 to 11, we can use range(len(df.index)).

In [24]:
ax = df.plot.line()
ax.set_xticks(range(len(df.index)))
ax.set_xticklabels(df.index)
None

This plot is correct, but crowded! There are two solutions: (1) make the plot wider or (2) rotate the labels. We'll demo both. We'll also add back the axis labels.

In [25]:
# approach 1: wider plot
ax = df.plot.line(figsize=(8,4)) # this is the (width,height)
ax.set_xticks(range(len(df.index)))
ax.set_xticklabels(df.index)
ax.set_xlabel('Month')
ax.set_ylabel('Temp (Fehrenheit)')
None
In [26]:
# approach 2: rotate ticks
ax = df.plot.line()
ax.set_xticks(range(len(df.index)))
ax.set_xticklabels(df.index, rotation=90) # 90 is in degrees
ax.set_xlabel('Month')
ax.set_ylabel('Temp (Fehrenheit)')
None

Bar Plots

Just like a line plot, bar plots can be created from either a Pandas Series or DataFrame. For our example data, let's learn a bit about the fire hydrants around the city of Madison. Data describing each fire hydrant can be found at http://data-cityofmadison.opendata.arcgis.com/datasets/54c4877f16084409849ebd5385e2ee27_6. We have already downloaded the data to a file named "Fire_Hydrants.csv". Let's read it and preview a few rows.

In [27]:
df = pd.read_csv('Fire_Hydrants.csv')
df.head()
Out[27]:
X Y OBJECTID CreatedBy CreatedDate LastEditor LastUpdate FacilityID DataSource ProjectNumber ... Elevation Manufacturer Style year_manufactured BarrelDiameter SeatDiameter Comments nozzle_color MaintainedBy InstallType
0 -89.519573 43.049308 2536 NaN NaN WUJAG 2018-06-07T19:45:53.000Z HYDR-2360-2 FASB NaN ... 1138.0 NaN Pacer 1996.0 5.0 NaN NaN blue MADISON WATER UTILITY NaN
1 -89.521988 43.049193 2537 NaN NaN WUJAG 2018-06-07T19:45:53.000Z HYDR-2360-4 FASB NaN ... 1170.0 NaN Pacer 1995.0 5.0 NaN NaN blue MADISON WATER UTILITY NaN
2 -89.522093 43.048233 2538 NaN NaN WUJAG 2018-06-07T19:45:53.000Z HYDR-2361-19 FASB NaN ... 1179.0 NaN Pacer 1996.0 5.0 NaN NaN blue MADISON WATER UTILITY NaN
3 -89.521013 43.049033 2539 NaN NaN WUJAG 2018-06-07T19:45:53.000Z HYDR-2360-3 FASB NaN ... 1163.0 NaN Pacer 1995.0 5.0 NaN NaN blue MADISON WATER UTILITY NaN
4 -89.524782 43.056263 2540 NaN NaN WUPTB 2017-08-31T16:19:46.000Z HYDR-2257-5 NaN NaN ... 1065.0 NaN Pacer 1996.0 5.0 NaN NaN blue MADISON WATER UTILITY NaN

5 rows × 25 columns

For our first example, let's see what nozzle colors are most common. We can get a Series summarizing the data by first extracting the nozzle_color column, then using the Series.value_counts() function to produce a summary Series.

In [28]:
df['nozzle_color'].head()
Out[28]:
0    blue
1    blue
2    blue
3    blue
4    blue
Name: nozzle_color, dtype: object
In [29]:
df['nozzle_color'].value_counts()
Out[29]:
blue      5810
Blue      1148
Green      320
Orange      74
BLUE        45
green        9
Red          9
orange       4
GREEN        1
white        1
C            1
ORANGE       1
Name: nozzle_color, dtype: int64

The above data means, for example, that there are 5810 "blue" nozzles and 1148 "Blue" nozzles. We can already see there is a lot of blue, but we would really like a total count, not confused by whether the letters are upper or lower case.

In [30]:
df['nozzle_color'].str.upper().value_counts()
Out[30]:
BLUE      7003
GREEN      330
ORANGE      79
RED          9
WHITE        1
C            1
Name: nozzle_color, dtype: int64

Great! It's not clear what "C" means, but the data is clean enough. Let's plot it with Series.plot.bar.

In [31]:
counts = df['nozzle_color'].str.upper().value_counts()
counts.plot.bar()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x11862eb00>

Is the data reasonable? Try to notice next time you're walking by a hydrant. Consider it a challenge to spot a green nozzle (bonus points for orange!).

For our second question, let's create a similar plot that tells us what model of hydrants are most common. The model is represented by the Style column in the table. The following code is a copy/paste of above, just replacing "nozzle_color" with "Style":

In [32]:
counts = df['Style'].str.upper().value_counts()
counts.plot.bar()
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1187a0588>

Woah! That's way too much data. Let's just consider the top 10 models.

In [33]:
top10 = counts[:10]
top10
Out[33]:
PACER             3620
M-3               1251
MUELLER           1243
WB-59              664
K-11               351
K-81               162
W-59               151
CLOW 2500          123
CLOW MEDALLION      70
CLOW                50
Name: Style, dtype: int64

How many others are not in the top 10? We should show that in our results too.

In [34]:
others = sum(counts[10:])
top10["others"] = others
top10
Out[34]:
PACER             3620
M-3               1251
MUELLER           1243
WB-59              664
K-11               351
K-81               162
W-59               151
CLOW 2500          123
CLOW MEDALLION      70
CLOW                50
others             229
Name: Style, dtype: int64

Now that looks like what we want to plot.

In [35]:
top10.plot.bar()
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x118cc55c0>

Nice! This shows us what we want. We see Pacer is easily the most common. Some of the longer texts are harder to read vertically, so we also have the option to use .barh instead of .bar to rotate the bars.

In [36]:
top10.plot.barh()
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x117b7a9e8>

I wonder what is up with all those Pacer hydrants? Have they always been so popular with the city? Turns out we can find out, because we also have a column called year_manufactured.

Let's find all the rows for Pacer hydrants and extract the year.

In [37]:
pacer_years = df[df['Style'] == 'Pacer']['year_manufactured']
pacer_years.head()
Out[37]:
0    1996.0
1    1995.0
2    1996.0
3    1995.0
4    1996.0
Name: year_manufactured, dtype: float64

Let's round to the decade. We can do that by dividing by 10 (integer division), then multiplying by 10 again.

In [38]:
pacer_decades = pacer_years // 10 * 10
pacer_decades.head()
Out[38]:
0    1990.0
1    1990.0
2    1990.0
3    1990.0
4    1990.0
Name: year_manufactured, dtype: float64

How many Pacers were there each decade?

In [39]:
pacer_decades.value_counts()
Out[39]:
2000.0    1730
1990.0     846
2010.0     503
1980.0      21
1960.0       1
Name: year_manufactured, dtype: int64

Let's do the same thing in one step for non-pacers. That is, we'll identify non-pacers, extract the year, round to the decade, and then count how many entries there are per decade.

In [40]:
other_decades = df[df['Style'] != 'Pacer']['year_manufactured'] // 10 * 10
other_decades.value_counts()
Out[40]:
2010.0    1196
1980.0     937
1970.0     578
1990.0     431
1950.0     371
1960.0     349
2000.0     215
1940.0      68
1930.0       9
1900.0       1
Name: year_manufactured, dtype: int64

Let's line up these two Series side-by-side in a DataFrame

In [41]:
pacer_df = DataFrame({
    "pacer":pacer_decades.value_counts(), 
    "other":other_decades.value_counts()
})
pacer_df
Out[41]:
pacer other
1900.0 NaN 1
1930.0 NaN 9
1940.0 NaN 68
1950.0 NaN 371
1960.0 1.0 349
1970.0 NaN 578
1980.0 21.0 937
1990.0 846.0 431
2000.0 1730.0 215
2010.0 503.0 1196

That looks plottable!

In [42]:
pacer_df.plot.bar()
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1179112e8>

That plot shows that the city started getting Pacers in the 90's. Most were from the 2000 decade, and it seems there is finally a shift to other styles.

While this plot is fine, when multiple bars represent a breakdown of a total amount, it's more intuitive to stack the bars over each other. This is easy with the stacked= argument.

In [43]:
pacer_df.plot.bar(stacked=True)
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x117838390>

This data supports all the same conclusions as before, and now one more thing is obvious: although there was stead growth in the number of hydrants over several decades, things seem to have leveled off more recently. Why? Further probing of the data might provide an answer. One explanation is that the 2000 decade contains 10 years, but we have a couple years left for the 10's. Perhaps this decade will still catch up.

Conclusion

After this reading, you should now be ready to create four types of plots: pie charts, scatter plots, line plots, and bar plots.

We saw that both line and bar plots can be created from either a single Series or a DataFrame. When created from a single Series, we end up with either a single line (for a line plot) or one set of bars (for a bar plot).

When we create from a DataFrame, we get multiple lines (one per column) for a line plot. And for a bar plot, we get multiple sets of bars. We can control whether those bars are vertical (with .bar) or horizontal (with .barh), as well as whether the bars are stacked or side-by-side.