You may will sometimes find yourself in a scenario where the standard plots are not a good fit for what you're doing. Fortunately, data visualizations are usually combinations of rectangles, circles, lines, and text; learning how to draw those at specific coordinates is very powerful. In this reading, we'll just focus on text. The challenge will be the coordinates, as there are multiple simultaneous systems in any figure.
As a concrete example, we'll try to reconstruct a very elegant custom plot posted to the /r/dataisbeautiful subreddit:
The original plot looks like this:
from IPython.core.display import Image
Image("reddit.png")
The order of the subplots tells you which letters are most common in English words ("E" is most common, representing 10.98% of occurences). The bars in each subplot tell you where those letters occur within the word. Y is almost aways at the end of a word; J is almost always at the beginning.
The version we'll create by the end of the reading will look like this:
Image("reading.png")
Before we try plotting, we want to build a table where there is one row per letter. Each of ten colunns will represent a location within a word. For example, 0.5 of the A row will tell us how many times the letter A appears in the middle of a word.
We'll can pull the capital letters from the alphabet using Python's builtin string
module:
from string import ascii_letters
print(ascii_letters)
caps = list(ascii_letters[26:])
caps[:10]
Let's start with an empty table with the desired rows and columns:
import pandas as pd
df = pd.DataFrame(0, index=caps, columns=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
df.head()
On Linux, the "/usr/share/dict/words" file usually contains a list of English words. We can use the head
command to see the first few lines. Putting !
in front of a command in Jupyter means to run this as a shell command (not Python code).
!head /usr/share/dict/words
If a letter is at index i
within a string word
, then this will give us the position of that letter in the range 0 (beginning) to 1 (end), rounded to the nearest tenth: round(i / (len(word) - 1), ndigits=1)
. Let's test that with a simple example.
word = "ABC"
i = 0 # at what percent through the word does A occur?
print(round(i / (len(word) - 1), ndigits=1))
i = 1 # at what percent through the word does B occur?
print(round(i / (len(word) - 1), ndigits=1))
i = 2 # at what percent through the word does C occur?
print(round(i / (len(word) - 1), ndigits=1))
Ok, let's now loop over every letter of every word in the English dictionary, counting it in our big table as we go (df.at[letter, position] += 1
).
with open("/usr/share/dict/words") as f:
for word in list(f):
word = word.strip().upper()
if len(word) == 1:
continue
for i in range(len(word)):
letter = word[i]
position = round(i / (len(word) - 1) * 10) / 10
if not letter in caps:
continue
df.at[letter, position] += 1
df.head()
Let's sum horizontally along each row, to determine the total number of occurences of each letter. We can divide this by the total count and sort to rank the letters by frequency.
counts = df.sum(axis=1)
percents = counts / counts.sum() * 100
percents = percents.sort_values(ascending=False)
percents.head()
Eventually, we'll create a matrix like the following, using the first 26 subplots for the 26 letters:
%matplotlib inline
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=5, ncols=6, figsize=(12, 6))
For now, let's think about how to plot one letter in one AxesSubplot.
df.head()
df.loc[SOME_LETTER]
pulls out a row of the big table (corresponding to the given letter) as a Series, which is easy enough to plot in a given ax.
A = df.loc["A"]
A
fig, ax = plt.subplots(figsize=(3,2))
A.plot.bar(ax=ax, color="0.85") # 0.5 is light gray (0 is black, 1 is white)
If we like, we can add some text using ax.plot(x, y, text)
.
fig, ax = plt.subplots(figsize=(3,2))
A.plot.bar(ax=ax, color="0.85")
ax.text(3, 16000, "hi", transform=ax.transData, size=20, color="red")
The transform=ax.transData
(which would have been the default had we exclude it) is telling matplotlib to interpret 3 and 16000 in terms of the scale of the data. 16000 makes sense by looking at the y-axis, but why is it 3 instead of 0.3? Matplotlib is treating the x-axis like a category, even though it looks numeric, as we can see by looking at xlim:
ax.get_xlim()
ax.transAxes
is a different coordinate system. The bottom left of the subplot is x=0, y=0; the top right is x=1, y=1. This is convenient, say, if we want to center the text (note that we're centering the bottom left of the text).
fig, ax = plt.subplots(figsize=(3,2))
A.plot.bar(ax=ax, color="0.85")
ax.text(0.5, 0.5, "hi", transform=ax.transAxes, size=20, color="red")
We can specify a negative x coord in the ax.transAxes
coordinate system to make the text float to the left of the subplot. Notice now that we're specifying the coordinates for the center (rather than the bottom left) by specifying alignment.
fig, ax = plt.subplots(figsize=(3,2))
A.plot.bar(ax=ax, color="0.85")
ax.text(-0.4, 0.5, "A", size=32,
verticalalignment="center", horizontalalignment="center",
transform=ax.transAxes)
Matplotlib has many colormaps that allow us to convert a quantity to point on a color bar: https://matplotlib.org/stable/gallery/color/colormap_reference.html. Colors are represented as tuples.
print(plt.cm.cool(0))
print(plt.cm.cool(0.5))
print(plt.cm.cool(1))
We can use these colors for our text.
fig, ax = plt.subplots(figsize=(3,2))
A.plot.bar(ax=ax, color="0.85")
ax.text(-0.4, 0.3, "A", size=32, transform=ax.transAxes, color=plt.cm.cool(0))
ax.text(-0.4, 0.5, "B", size=32, transform=ax.transAxes, color=plt.cm.cool(0.5))
ax.text(-0.4, 0.7, "C", size=32, transform=ax.transAxes, color=plt.cm.cool(1.0))
We can use these colors to indicate something about the data (like letter frequency).
We've seen how we can pull the info for a given letter necessary to create a bar plot showing the frequency of different locations. We've also seen how we can use the ax.text
and the ax.transAxes
coordinate system to add some text to the left of a subplot. We're going to need to create 26 such subplots, so lets put together all that we've done so far into a plot_letter
function. We pass in letter
, and the function directly grabs the relevant data from our df
earlier in the notebook. We also pass in ax
so that the function knows where to plot.
def plot_letter(letter, ax):
percent = percents.at[letter]
color = plt.cm.cool(percent / percents.max())
positions = df.loc[letter]
positions = positions / positions.sum() * 100
positions = positions[:10]
ax.text(-0.4, 0.5, letter, size=32,
verticalalignment="center", horizontalalignment="center",
transform=ax.transAxes, color=color)
ax.text(-0.4, 0.1, str(round(percent, 1)) + "%", size=12,
horizontalalignment="center", verticalalignment="center",
transform=ax.transAxes)
positions.plot.bar(ax=ax, width=1, color="0.7")
fig, axes = plt.subplots(ncols=2, figsize=(6, 2))
plt.subplots_adjust(wspace=1)
plot_letter("H", axes[0])
plot_letter("I", axes[1])
We see we can use our plot_letter
function to plots "H" and "I" in two subplots. Let's try the whole alphabet. Also, we can now get rid of the axes/borders using ax.axis("off")
for a cleaner plot.
fig, axes = plt.subplots(nrows=5, ncols=6, figsize=(12, 6))
plt.subplots_adjust(wspace=1)
axes = list(axes.reshape(-1))
for i in range(len(percents)):
letter = percents.index[i]
plot_letter(letter, axes[i])
for ax in axes:
ax.axis("off")
Finally, let's save our work.
fig.savefig("reading.png", bbox_inches="tight")