Ok, enough with graph searching! In this reading (notebook here), we'll write some code that makes a reasonable guess about what language (English, Spanish, etc) a given string is in -- something like this module: https://pypi.org/project/langdetect/.
There are many ways we might approach this problem. We'll look at what the order of letters. For example, the letters "a" and "o" are more likely to be and the end of a word in Spanish than they are in English.
We can model the transitions between letters in a language with Markov chain. A Markov chain is a graph where nodes represent states and weighted edges represent the probably of transitioning between states. Our Markov chain will look like this:
We will construct our language models (Markov chains) based on sample data, pulled from Wikipedia articles in multiple languages.
Given an example string, we can compute the likilihood that a given language model would generate that string, if we were to use the model to randomly generate a string. Of course, the probability of generating a particular string of any significant size is very tiny. However, we can compare the tiny likelihoods of two models to determine which language is more likely. This will require us to deal with extremely small floats. When they get too small (meaning the computer rounds them to zero), we'll learn a trick, log likelihood to deal with it.
The kinds of grahps we've been generating, where each node has a list of children (these are called sparse graphs), are not efficient when there is an edge between every pair of nodes. In this example, we'll learn to create a dense graph. We'll represent edges with a big table (DataFrame) -- there will be a column for each node and a row for each node. A number in the cell at row A and column B represents the weight on the edge between these nodes.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import copy, string, os
We need to first implement a dense graph, which will be useful for then building our language models. Look at the following code, then read the descriptions below, corresponding to the commented lines.
class dense_graph:
def __init__(self, nodes):
self.nodes = sorted(nodes)
self.nodes.append("?") # A
self.node_set = set(nodes)
self.edges = pd.DataFrame(index=self.nodes, columns=self.nodes)
self.edges.fillna(1, inplace=True) # B
def _repr_html_(self):
with pd.option_context('display.max_columns', None): # C
return self.edges._repr_html_()
def inc(self, src, dst):
if not src in self.node_set:
src = "?"
if not dst in self.node_set:
dst = "?"
self.edges.at[src, dst] += 1
def get_edge(self, src, dst):
if not src in self.node_set:
src = "?"
if not dst in self.node_set:
dst = "?"
return self.edges.at[src, dst] # D
def to_prob(self):
# E
g = copy.deepcopy(self)
row_sums = g.edges.sum(axis=1)
g.edges = g.edges.div(row_sums, axis=0)
return g
g = dense_graph("ABC") # F
A) to keep our model simple, we'll only have 26 nodes (for the 26 English letters, which also appear frequently in the other languages we'll model). We'll use "?" as a catch-all for punctuation and other characters.
B) the cell at row A, col B represents how many times letter B comes after letter A. You might expect we would count starting from zero, but starting at one will actually help later, as we don't have enough data to say something envers happens (details here for the curious: https://en.wikipedia.org/wiki/Additive_smoothing#Pseudocount)
C) when displaying the graph, we'll just show the underlying table of edge weights. This context manager let's us make sure the columns are not hidden by Jupyter to save space.
D) .at
works .loc
when you only want to access one cell (but it's faster!)
E) we'll create a normalized table where the values in each row correspond to probabilities that add to 1. Remember that axis=0
goes down and axis=1
goes across. So we're computing sums across (axis=1
), to get a sum per row. Then, we want to orient that series of sums vertically (axis=1
) so we can line up a row per sum in the division.
F) normally we would pass in a list of nodes, but the constructor immediately calls sorted
on whatever we pass in, so we can pass in any sequence -- so here, we'll have 3 nodes: A, B, C
Let's use the increment method (inc
) to increase the weights on some edges:
g.inc("A", "B")
g.inc("A", "B")
g.inc("A", "B")
g.inc("A", "C")
g.inc("B", "D")
g.inc("E", "A")
g
If we look in the "A" row, we see "A" appeared 8 times total; 4 of those times it was followed by a "B". Let's normalize:
probs = g.to_prob()
probs
The cell at row A, col B tells us that 50% of the time, the letter after "A" is a "B". The get_edge
method is a convenient way to lookup this up.
probs.get_edge("A", "B")
We'll download and extract the text of 7 wikipedia articles:
urls = {
"english": "https://en.wikipedia.org/wiki/Python_(programming_language)",
"spanish": "https://es.wikipedia.org/wiki/Python",
"german": "https://de.wikipedia.org/wiki/Python_(Programmiersprache)",
"french": "https://fr.wikipedia.org/wiki/Python_(langage)",
"italian": "https://it.wikipedia.org/wiki/Python",
"english-test": "https://en.wikipedia.org/wiki/Giant_panda",
"spanish-test": "https://es.wikipedia.org/wiki/Ailuropoda_melanoleuca",
}
texts = {}
for lang, url in urls.items():
path = lang + ".txt"
# have we downloaded it before?
if not os.path.exists(path):
r = requests.get(url)
r.raise_for_status()
page = BeautifulSoup(r.text)
with open(path, "w") as f:
f.write(page.get_text())
# for simplicity, strip out everything except lower
# case English letters, periods, and commas
with open(path) as f:
valid = string.ascii_lowercase + " .,"
text = []
for c in f.read().lower():
if c in valid:
text.append(c)
else:
text.append("?")
texts[lang] = "".join(text)
Let's take a look at the first two files. Notice that we're also accidentally grabbing some English-like text for the web page code in both cases. That will confuse our models a bit, but hopefully not too much!
print(texts["english"][:5000])
print(texts["spanish"][:5000])
Let's write a LangProfile
class to model the letter transitions in a language. We'll train the model with some exmaple input text in the constructor, to determine the transition probabilities.
class LangProfile:
def __init__(self, name, text):
self.name = name
g = dense_graph(valid)
for i in range(len(text)-1):
g.inc(text[i], text[i+1])
self.graph = g.to_prob()
def prob(self, text):
p = 1
for i in range(len(text)-1):
p *= self.graph.get_edge(text[i], text[i+1])
return p
english = LangProfile("english", texts["english"])
spanish = LangProfile("spanish", texts["spanish"])
spanish.graph.edges.iloc[:8,:8]
The above table tells us that in Spanish, 10.4% of words start with a "d" (the frequency of that letter after a space), whereas only 1.6% start with a "b".
We can notice significant differences between the languages. For examples, only about 6% of "a" or "e" appearances in an English sentence end a word (i.e., they are followed by a space). For Spanish, it is about 20%.
print("English O/A ending:", english.prob("o "), english.prob("a "))
print("Spanish O/A ending:", spanish.prob("o "), spanish.prob("a "))
We can also use the prob
method to compute the odds that our (extremely simplified) models of each language would generate a given word, if we took the starting letter then kept randomly appending based on the edge weights.
How likely is the English model to generate the word "house"? The Spanish model?
english.prob("house"), spanish.prob("house")
We can see that the word "house" fits the English model better -- good! Let's try the Spanish word for the same:
english.prob("casa"), spanish.prob("casa")
Great, that one fits Spanish better! (note the scientific notation: the english model number is very small).
Don't be concerned that the numbers are very small. If we're generating a random English-like word, there are millions of strings that we could come up with. So it's OK that "casa" is small in the Spanish model. The interesting thing is that the Spanish model gives the bigger number, even though both are very small.
The longer strings are, the smaller likelihoods we'll get:
print(english.prob("this is an example of a sentence in english, can we detect that?"))
print(spanish.prob("this is an example of a sentence in english, can we detect that?"))
Take a close look, those are extremely small numbers in scientific notation! Let's generate a slightly longer string and try that:
long_str = "this is a sentence. " * 20
print(long_str)
print(english.prob(long_str))
print(spanish.prob(long_str))
The prob
function only works for small strings. For bigger strings, the likelihoods get so small that they get rounded to zero.
One good way to deal with very small or very large numbers is to take the log of them. Let's do that!
You may have heard that the base-10 log of an integer greater than 1 is approximately the number of digits in the number (with error at most 1).
Here's an easy-to-remember approximation for small numbers too:
from math import log10
def loggy(x):
if x > 1:
return len(str(int(x)))
else:
return -len(str(int(1/x)))
val = 0.0001
for i in range(20):
print(val, log10(val), loggy(val))
val *= 2
As you can see, the error is never more than 1 (compare the loggy
approximation with the actual log10
). More importantly, thinking of loggy
will help us intuit various log rules.
For example, if you multiply a 10 digit integer of a 20 digit integer (both positive), you can probably guess that the result will be about 30 digits. This is the intuition behind
$log10(10*20) == log10(10) + log10(20)$
Let's see this by multiplying a 16 digit number by an 11 digit number:
A = 1259061235607506
B = 12498123469
# approximate
print(loggy(A), loggy(B))
print(loggy(A * B))
print(loggy(A) + loggy(B))
# actual calculation
print(log10(A), log10(B))
print(log10(A * B))
print(log10(A) + log10(B))
Although the intuition may be less obvious for very small positive numbers, the multiplication rule still holds:
$log10(X*Y) = log10(X) + log10(Y)$
Before, we ran in trouble because we multiplied so many probabilities <1 together that we ultimately rounded to 0 for the likelihood, but now we can use this log rule to compute the log likelihood. This will let us add the log of all the individual probabilities together to get the final log likelihood.
How is the log of the likelihood useful? Well, if we compute it for two models, we can figure out which one is more likely to generate the string in question. We never cared about a precise likelihood calculation anyway, we only wanted to make this comparison to choose the best language model that best explains the text.
Let's monkey patch in the method to compute the log likelihood:
def log_prob(self, text):
#p = 1
logp = 0 # log10(1)
for i in range(len(text)-1):
# p *= self.graph.get_edge(text[i], text[i+1])
logp += log10(self.graph.get_edge(text[i], text[i+1]))
return logp
LangProfile.log_prob = log_prob
As a test, lets make sure that taking the log of our old result (with prob
) gives us the same answer the new approach, where we sum the logs of each individual probability:
english = LangProfile("english", texts["english"])
spanish = LangProfile("spanish", texts["spanish"])
print(log10(english.prob("house")))
print(english.log_prob("house"))
Yay, it works! In general, we'll always be seeing negative number for log likelihood, because all the likelihoods will be <1. A log likelihood of -2 means the model is better than if the log likelihood is -3.
Does this fix our earlier problem, when we were rounding to zero?
print(long_str)
print(english.prob(long_str))
print(spanish.prob(long_str))
print(english.log_prob(long_str))
print(spanish.log_prob(long_str))
Yes it does! Now we can tell the English model is more likely to produce a string like that than the Spanish model.
So far, we've built a dense graph, then used that to build a Markov chain-based model, which we can train per language. These models can tell us the likelihood of producing a given string.
The last part is to get the likelihood for each language model for a given string, then predict the language of the string based on whichever model give the biggest likelihood. Let's do that with a LangPredictor
class:
class LangPredictor:
def __init__(self, profiles):
self.profiles = profiles
def predict(self, line):
profile = max(self.profiles, key=lambda profile: profile.log_prob(line))
return profile.name
def percents(self, lines):
counts = {p.name: 0 for p in self.profiles}
for line in lines:
counts[self.predict(line)] += 1
for k in counts:
counts[k] /= len(lines)
return counts
p = LangPredictor([
LangProfile("english", texts["english"]),
LangProfile("spanish", texts["spanish"]),
LangProfile("french", texts["french"]),
LangProfile("italian", texts["italian"]),
LangProfile("german", texts["german"]),
])
Let's try it for some simple strings:
p.predict("hello friends!")
p.predict("hola amigos!")
Let's do a more comprehensive test. We'll take the tests inputs (the English and Spanish wikipedia pages describing the giant panda), break them into sentences, then see what percentage of the sentences get classified as each language.
p.percents(texts["english-test"].split("."))
p.percents(texts["spanish-test"].split("."))
Not too bad! The predictor thinks a majority (56%) of the English sentences are actually English, and a majority (56%) of the Spanish sentences are actually Spanish. The mistakes are spread across languages. For example, 13% of the sentences in the Spanish article are classified as Fresh and 1% are classified as German.
There are certain things we could do to improve our accuracy:
In this reading, we learned about layered design. At the foundation, we built a dense graph, that represents edge data in a big table. On that, we build a class that models languages as Markov chains; likelihoods of long strings were so small that they got rounded to zero, so we used log likelihood instead. Finally, we built a predictor that uses multiple models to find the best fit for a string in an unknown language.