Dense Graphs: Language Detection Example

Ok, enough with graph searching! In this reading (notebook here), we'll write some code that makes a reasonable guess about what language (English, Spanish, etc) a given string is in -- something like this module: https://pypi.org/project/langdetect/.

There are many ways we might approach this problem. We'll look at what the order of letters. For example, the letters "a" and "o" are more likely to be and the end of a word in Spanish than they are in English.

We can model the transitions between letters in a language with Markov chain. A Markov chain is a graph where nodes represent states and weighted edges represent the probably of transitioning between states. Our Markov chain will look like this:

  • each node will be a letter
  • there will be a directed edge between every pair of nodes
  • each edge will have a number associated with it, indicating the probability that the node/letter pointed to will follow the node/letter pointed from

We will construct our language models (Markov chains) based on sample data, pulled from Wikipedia articles in multiple languages.

Given an example string, we can compute the likilihood that a given language model would generate that string, if we were to use the model to randomly generate a string. Of course, the probability of generating a particular string of any significant size is very tiny. However, we can compare the tiny likelihoods of two models to determine which language is more likely. This will require us to deal with extremely small floats. When they get too small (meaning the computer rounds them to zero), we'll learn a trick, log likelihood to deal with it.

The kinds of grahps we've been generating, where each node has a list of children (these are called sparse graphs), are not efficient when there is an edge between every pair of nodes. In this example, we'll learn to create a dense graph. We'll represent edges with a big table (DataFrame) -- there will be a column for each node and a row for each node. A number in the cell at row A and column B represents the weight on the edge between these nodes.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import copy, string, os

Dense Graph

We need to first implement a dense graph, which will be useful for then building our language models. Look at the following code, then read the descriptions below, corresponding to the commented lines.

In [2]:
class dense_graph:
    def __init__(self, nodes):
        self.nodes = sorted(nodes)
        self.nodes.append("?") # A
        self.node_set = set(nodes)
        self.edges = pd.DataFrame(index=self.nodes, columns=self.nodes)
        self.edges.fillna(1, inplace=True) # B

    def _repr_html_(self):
        with pd.option_context('display.max_columns', None): # C
            return self.edges._repr_html_()

    def inc(self, src, dst):
        if not src in self.node_set:
            src = "?"
        if not dst in self.node_set:
            dst = "?"
        self.edges.at[src, dst] += 1
        
    def get_edge(self, src, dst):
        if not src in self.node_set:
            src = "?"
        if not dst in self.node_set:
            dst = "?"
        return self.edges.at[src, dst] # D
        
    def to_prob(self):
        # E
        g = copy.deepcopy(self)
        row_sums = g.edges.sum(axis=1)
        g.edges = g.edges.div(row_sums, axis=0)
        return g

g = dense_graph("ABC") # F

A) to keep our model simple, we'll only have 26 nodes (for the 26 English letters, which also appear frequently in the other languages we'll model). We'll use "?" as a catch-all for punctuation and other characters.

B) the cell at row A, col B represents how many times letter B comes after letter A. You might expect we would count starting from zero, but starting at one will actually help later, as we don't have enough data to say something envers happens (details here for the curious: https://en.wikipedia.org/wiki/Additive_smoothing#Pseudocount)

C) when displaying the graph, we'll just show the underlying table of edge weights. This context manager let's us make sure the columns are not hidden by Jupyter to save space.

D) .at works .loc when you only want to access one cell (but it's faster!)

E) we'll create a normalized table where the values in each row correspond to probabilities that add to 1. Remember that axis=0 goes down and axis=1 goes across. So we're computing sums across (axis=1), to get a sum per row. Then, we want to orient that series of sums vertically (axis=1) so we can line up a row per sum in the division.

F) normally we would pass in a list of nodes, but the constructor immediately calls sorted on whatever we pass in, so we can pass in any sequence -- so here, we'll have 3 nodes: A, B, C

Let's use the increment method (inc) to increase the weights on some edges:

In [3]:
g.inc("A", "B")
g.inc("A", "B")
g.inc("A", "B")
g.inc("A", "C")
g.inc("B", "D")
g.inc("E", "A")
g
Out[3]:
A B C ?
A 1 4 2 1
B 1 1 1 2
C 1 1 1 1
? 2 1 1 1

If we look in the "A" row, we see "A" appeared 8 times total; 4 of those times it was followed by a "B". Let's normalize:

In [4]:
probs = g.to_prob()
probs
Out[4]:
A B C ?
A 0.125 0.50 0.25 0.125
B 0.200 0.20 0.20 0.400
C 0.250 0.25 0.25 0.250
? 0.400 0.20 0.20 0.200

The cell at row A, col B tells us that 50% of the time, the letter after "A" is a "B". The get_edge method is a convenient way to lookup this up.

In [5]:
probs.get_edge("A", "B")
Out[5]:
0.5

Language Examples

We'll download and extract the text of 7 wikipedia articles:

  • 5 pages describing the Python programming language, in English, Spanish, German, French, and Italian. We'll later create a model based on each of these.
  • 2 pages (in English and Spanish) about giant pandas, the animal. We'll test our models to see if we can automatically detect what language these are in.
In [6]:
urls = {
    "english": "https://en.wikipedia.org/wiki/Python_(programming_language)",
    "spanish": "https://es.wikipedia.org/wiki/Python",
    "german": "https://de.wikipedia.org/wiki/Python_(Programmiersprache)",
    "french": "https://fr.wikipedia.org/wiki/Python_(langage)",
    "italian": "https://it.wikipedia.org/wiki/Python",
    "english-test": "https://en.wikipedia.org/wiki/Giant_panda",
    "spanish-test": "https://es.wikipedia.org/wiki/Ailuropoda_melanoleuca",
}

texts = {}

for lang, url in urls.items():
    path = lang + ".txt"
    
    # have we downloaded it before?
    if not os.path.exists(path):
        r = requests.get(url)
        r.raise_for_status()
        page = BeautifulSoup(r.text)
        with open(path, "w") as f:
            f.write(page.get_text())
    
    # for simplicity, strip out everything except lower
    # case English letters, periods, and commas
    with open(path) as f:
        valid = string.ascii_lowercase + " .,"
        text = []
        for c in f.read().lower():
            if c in valid:
                text.append(c)
            else:
                text.append("?")
        texts[lang] = "".join(text)

Let's take a look at the first two files. Notice that we're also accidentally grabbing some English-like text for the web page code in both cases. That will confuse our models a bit, but hopefully not too much!

In [7]:
print(texts["english"][:5000])
????python ?programming language? ? wikipedia?document.documentelement.classname??client?js??rlconf???wgbreakframes????,?wgseparatortransformtable?????,???,?wgdigittransformtable?????,???,?wgdefaultdateformat???dmy?,?wgmonthnames?????,?january?,?february?,?march?,?april?,?may?,?june?,?july?,?august?,?september?,?october?,?november?,?december??,?wgmonthnamesshort?????,?jan?,?feb?,?mar?,?apr?,?may?,?jun?,?jul?,?aug?,?sep?,?oct?,?nov?,?dec??,?wgrequestid???xk?jlgpaaeaaae?x??oaaacr?,?wgcspnonce????,?wgcanonicalnamespace????,?wgcanonicalspecialpagename????,?wgnamespacenumber???,?wgpagename???python??programming?language??,?wgtitle???python ?programming language??,?wgcurrevisionid???????????,?wgrevisionid???????????,?wgarticleid???????,?wgisarticle????,?wgisredirect????,?wgaction???view?,?wgusername??null,?wgusergroups???????,?wgcategories????articles with short description?,?use dmy dates from august ?????,?all articles with unsourced statements?,?articles with unsourced statements from december ?????,??articles containing potentially dated statements from november ?????,?all articles containing potentially dated statements?,?articles containing potentially dated statements from february ?????,?articles with unsourced statements from february ?????,?articles with curlie links?,?wikipedia articles with bnf identifiers?,?wikipedia articles with gnd identifiers?,?wikipedia articles with lccn identifiers?,?wikipedia articles with sudoc identifiers?,?good articles?,?articles with example python code?,?programming languages?,?class?based programming languages?,?computational notebook?,?computer science in the netherlands?,?cross?platform free software?,?dutch inventions?,?dynamically typed programming languages?,?educational programming languages?,?high?level programming languages?,?information technology in the netherlands?,?multi?paradigm programming languages?,?object?oriented programming languages?,?programming languages created in ?????,?python ?programming language??,??scripting languages?,?text?oriented programming languages?,?cross?platform software??,?wgpagecontentlanguage???en?,?wgpagecontentmodel???wikitext?,?wgrelevantpagename???python??programming?language??,?wgrelevantarticleid???????,?wgisprobablyeditable????,?wgrelevantpageisprobablyeditable????,?wgrestrictionedit????,?wgrestrictionmove????,?wgmediavieweronclick????,?wgmediaviewerenabledbydefault????,?wgpopupsreferencepreviews????,?wgpopupsconflictswithnavpopupgadget????,?wgvisualeditor????pagelanguagecode???en?,?pagelanguagedir???ltr?,?pagevariantfallbacks???en??,?wgmfdisplaywikibasedescriptions????search????,?nearby????,?watchlist????,?tagline?????,?wgwmeschemaeditattemptstepoversample????,?wgulscurrentautonym???english?,?wgnoticeproject???wikipedia?,?wgwikibaseitemid???q??????,?wgcentralauthmobiledomain????,?wgeditsubmitbuttonlabelpublish??????rlstate???ext.globalcssjs.user.styles???ready?,?site.styles???ready?,?noscript???ready?,?user.styles???ready?,??ext.globalcssjs.user???ready?,?user???ready?,?user.options???ready?,?user.tokens???loading?,?ext.cite.styles???ready?,?ext.pygments???ready?,?mediawiki.legacy.shared???ready?,?mediawiki.legacy.commonprint???ready?,?jquery.makecollapsible.styles???ready?,?mediawiki.toc.styles???ready?,?skins.vector.styles???ready?,?wikibase.client.init???ready?,?ext.visualeditor.desktoparticletarget.noscript???ready?,?ext.uls.interlanguage???ready?,?ext.wikimediabadges???ready???rlpagemodules???ext.cite.ux?enhancements?,?ext.scribunto.logs?,?site?,?mediawiki.page.startup?,?skins.vector.js?,?mediawiki.page.ready?,?jquery.makecollapsible?,?mediawiki.toc?,?ext.gadget.referencetooltips?,?ext.gadget.watchlist?notice?,?ext.gadget.drn?wizard?,?ext.gadget.charinsert?,?ext.gadget.reftoolbar?,?ext.gadget.extra?toolbar?buttons?,?ext.gadget.switcher?,?ext.centralauth.centralautologin?,?mmv.head?,?mmv.bootstrap.autostart?,?ext.popups?,?ext.visualeditor.desktoparticletarget.init?,?ext.visualeditor.targetloader?,??ext.eventlogging?,?ext.wikimediaevents?,?ext.navigationtiming?,?ext.uls.compactlinks?,?ext.uls.interface?,?ext.cx.eventlogging.campaigns?,?ext.quicksurveys.init?,?ext.centralnotice.geoip?,?ext.centralnotice.startup?????rlq?window.rlq?????.push?function???mw.loader.implement??user.tokens?tffin?,function??,jquery,require,module?????nomin??mw.user.tokens.set???patroltoken???????,?watchtoken???????,?csrftoken????????????????????????????????????????????????python ?programming language???from wikipedia, the free encyclopedia???jump to navigation?jump to search?for other uses, see python.?general?purpose, high?level programming language???pythonparadigmmulti?paradigm? functional, imperative, object?oriented, reflectivedesigned?byguido van rossumdeveloperpython software foundationfirst?appeared????? ???years ago??????????stable release?.?.??   ? ???december ????? ? months ago????????????????preview release?.?.?a??   ? ???january ????? ?? days ago?????????????????typing disciplineduck, dynamic, gradual ?since ?.?????licensepython s
In [8]:
print(texts["spanish"][:5000])
????python ? wikipedia, la enciclopedia libre?document.documentelement.classname??client?js??rlconf???wgbreakframes????,?wgseparatortransformtable????,?t.?,???t,??,?wgdigittransformtable?????,???,?wgdefaultdateformat???dmy?,?wgmonthnames?????,?enero?,?febrero?,?marzo?,?abril?,?mayo?,?junio?,?julio?,?agosto?,?septiembre?,?octubre?,?noviembre?,?diciembre??,?wgmonthnamesshort?????,?ene?,?feb?,?mar?,?abr?,?may?,?jun?,?jul?,?ago?,?sep?,?oct?,?nov?,?dic??,?wgrequestid???xkvjzapamewaai?plcmaaabf?,?wgcspnonce????,?wgcanonicalnamespace????,?wgcanonicalspecialpagename????,?wgnamespacenumber???,?wgpagename???python?,?wgtitle???python?,?wgcurrevisionid???????????,?wgrevisionid???????????,?wgarticleid??????,?wgisarticle????,?wgisredirect????,?wgaction???view?,?wgusername??null,?wgusergroups???????,?wgcategories????wikipedia?art?culos con datos por trasladar a wikidata?,?wikipedia?art?culos destacados en la wikipedia en ruso?,?wikipedia?art?culos buenos en la wikipedia en alem?n?,??wikipedia?art?culos buenos en la wikipedia en ingl?s?,?wikipedia?art?culos con identificadores bnf?,?wikipedia?art?culos con identificadores gnd?,?wikipedia?art?culos con identificadores lccn?,?python?,?lenguajes de programaci?n orientada a objetos?,?lenguajes de programaci?n de alto nivel?,?lenguajes de programaci?n din?micamente tipados?,?lenguajes de programaci?n educativos?,?software de ?????,?pa?ses bajos en ?????,?ciencia y tecnolog?a de los pa?ses bajos??,?wgpagecontentlanguage???es?,?wgpagecontentmodel???wikitext?,?wgrelevantpagename???python?,?wgrelevantarticleid??????,?wgisprobablyeditable????,?wgrelevantpageisprobablyeditable????,?wgrestrictionedit????,?wgrestrictionmove????,?wgmediavieweronclick????,?wgmediaviewerenabledbydefault????,?wgpopupsreferencepreviews????,?wgpopupsconflictswithnavpopupgadget????,?wgvisualeditor????pagelanguagecode???es?,?pagelanguagedir???ltr?,?pagevariantfallbacks???es??,?wgmfdisplaywikibasedescriptions????search????,??nearby????,?watchlist????,?tagline?????,?wgwmeschemaeditattemptstepoversample????,?wgulscurrentautonym???espa?ol?,?wgnoticeproject???wikipedia?,?wgwikibaseitemid???q??????,?wgcentralauthmobiledomain????,?wgeditsubmitbuttonlabelpublish??????rlstate???ext.gadget.imagenesinfobox???ready?,?ext.globalcssjs.user.styles???ready?,?site.styles???ready?,?noscript???ready?,?user.styles???ready?,?ext.globalcssjs.user???ready?,?user???ready?,?user.options???loading?,?user.tokens???loading?,?ext.cite.styles???ready?,?ext.pygments???ready?,?mediawiki.legacy.shared???ready?,?mediawiki.legacy.commonprint???ready?,?mediawiki.toc.styles???ready?,?mediawiki.skinning.interface???ready?,?skins.vector.styles???ready?,?wikibase.client.init???ready?,?ext.visualeditor.desktoparticletarget.noscript???ready?,?ext.uls.interlanguage???ready?,?ext.wikimediabadges???ready???rlpagemodules???ext.cite.ux?enhancements?,?site?,?mediawiki.page.startup?,?skins.vector.js?,?mediawiki.page.ready?,??mediawiki.toc?,?ext.gadget.a?commons?directo?,?ext.gadget.referencetooltips?,?ext.gadget.reftoolbar?,?ext.centralauth.centralautologin?,?mmv.head?,?mmv.bootstrap.autostart?,?ext.popups?,?ext.visualeditor.desktoparticletarget.init?,?ext.visualeditor.targetloader?,?ext.eventlogging?,?ext.wikimediaevents?,?ext.navigationtiming?,?ext.uls.compactlinks?,?ext.uls.interface?,?ext.cx.eventlogging.campaigns?,?ext.quicksurveys.init?,?ext.centralnotice.geoip?,?ext.centralnotice.startup?????rlq?window.rlq?????.push?function???mw.loader.implement??user.options?wq????,function??,jquery,require,module?????nomin??mw.user.options.set???variant???es????????mw.loader.implement??user.tokens?tffin?,function??,jquery,require,module?????nomin??mw.user.tokens.set???patroltoken???????,?watchtoken???????,?csrftoken?????????????????????????????????????????????????python??de wikipedia, la enciclopedia libre???ir a la navegaci?n?ir a la b?squeda?este art?culo trata sobre el lenguaje de programaci?n. para el grupo de humoristas, v?ase monty python. para el rev?lver, v?ase colt python.? para otros usos de este t?rmino, v?ase pit?n.?python?desarrollador?es??python software foundationsitio web oficialinformaci?n generalextensiones comunes?.py, .pyc, .pyd, .pyo, .pyw, .pyzparadigma?multiparadigma? orientado a objetos, imperativo, funcional, reflexivoapareci? en?????dise?ado por?guido van rossum?ltima versi?n estable??.?.????????? de diciembre de ???? ?? meses?sistema de tipos?fuertemente tipado, din?micoimplementaciones?cpython, ironpython, jython, python for s??, pypy, activepython, unladen swallowdialectos?stackless python, rpythoninfluido por?abc, algol ??, c, haskell, icon, lisp, modula??, perl, smalltalk, javaha influido a?boo, cobra, d, falcon, genie, groovy, ruby, javascript, cython, go latinosistema operativo?multiplataformalicencia?python software foundation license?editar datos en wikidata??python es un lenguaje de programaci?n interpretado cuya filosof?a hace hincapi? en la legibilidad de su c?digo.?se trata de un lenguaje de programaci?n multiparadigma, ya que

Language Profiles

Let's write a LangProfile class to model the letter transitions in a language. We'll train the model with some exmaple input text in the constructor, to determine the transition probabilities.

In [9]:
class LangProfile:
    def __init__(self, name, text):
        self.name = name

        g = dense_graph(valid)
        for i in range(len(text)-1):
            g.inc(text[i], text[i+1])
        self.graph = g.to_prob()

    def prob(self, text):
        p = 1
        for i in range(len(text)-1):
            p *= self.graph.get_edge(text[i], text[i+1])
        return p
In [10]:
english = LangProfile("english", texts["english"])
spanish = LangProfile("spanish", texts["spanish"])
spanish.graph.edges.iloc[:8,:8]
Out[10]:
, . a b c d e
0.059830 0.000303 0.003938 0.048470 0.016056 0.067707 0.103756 0.095577
, 0.634473 0.001486 0.001486 0.004458 0.004458 0.001486 0.001486 0.001486
. 0.230100 0.002488 0.130597 0.007463 0.002488 0.029851 0.009950 0.007463
a 0.213605 0.014293 0.010852 0.002118 0.030439 0.062467 0.081525 0.001588
b 0.017575 0.007030 0.010545 0.096661 0.001757 0.035149 0.024605 0.050967
c 0.007563 0.002909 0.004072 0.134380 0.000582 0.041303 0.003490 0.078534
d 0.023166 0.002758 0.001655 0.120794 0.002758 0.001103 0.001655 0.423056
e 0.226607 0.008568 0.009696 0.022548 0.006539 0.035400 0.046449 0.004510

The above table tells us that in Spanish, 10.4% of words start with a "d" (the frequency of that letter after a space), whereas only 1.6% start with a "b".

We can notice significant differences between the languages. For examples, only about 6% of "a" or "e" appearances in an English sentence end a word (i.e., they are followed by a space). For Spanish, it is about 20%.

In [11]:
print("English O/A ending:", english.prob("o "), english.prob("a "))
print("Spanish O/A ending:", spanish.prob("o "), spanish.prob("a "))
English O/A ending: 0.058113544926240504 0.05953878406708595
Spanish O/A ending: 0.19825783972125435 0.21360508205399684

We can also use the prob method to compute the odds that our (extremely simplified) models of each language would generate a given word, if we took the starting letter then kept randomly appending based on the edge weights.

How likely is the English model to generate the word "house"? The Spanish model?

In [12]:
english.prob("house"), spanish.prob("house")
Out[12]:
(0.0001949407907586606, 3.979239280202041e-05)

We can see that the word "house" fits the English model better -- good! Let's try the Spanish word for the same:

In [13]:
english.prob("casa"), spanish.prob("casa")
Out[13]:
(9.689361588272094e-05, 0.0003401731927203949)

Great, that one fits Spanish better! (note the scientific notation: the english model number is very small).

Don't be concerned that the numbers are very small. If we're generating a random English-like word, there are millions of strings that we could come up with. So it's OK that "casa" is small in the Spanish model. The interesting thing is that the Spanish model gives the bigger number, even though both are very small.

Log Likelihood: Motivation

The longer strings are, the smaller likelihoods we'll get:

In [14]:
print(english.prob("this is an example of a sentence in english, can we detect that?"))
print(spanish.prob("this is an example of a sentence in english, can we detect that?"))
7.965927008415434e-67
7.471379930386457e-72

Take a close look, those are extremely small numbers in scientific notation! Let's generate a slightly longer string and try that:

In [15]:
long_str = "this is a sentence. " * 20
print(long_str)
print(english.prob(long_str))
print(spanish.prob(long_str))
this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. 
0.0
0.0

The prob function only works for small strings. For bigger strings, the likelihoods get so small that they get rounded to zero.

One good way to deal with very small or very large numbers is to take the log of them. Let's do that!

Review Logs, Product Rule

You may have heard that the base-10 log of an integer greater than 1 is approximately the number of digits in the number (with error at most 1).

Here's an easy-to-remember approximation for small numbers too:

In [16]:
from math import log10

def loggy(x):
    if x > 1:
        return len(str(int(x)))
    else:
        return -len(str(int(1/x)))

val = 0.0001
for i in range(20):
    print(val, log10(val), loggy(val))
    val *= 2
0.0001 -4.0 -5
0.0002 -3.6989700043360187 -4
0.0004 -3.3979400086720375 -4
0.0008 -3.0969100130080562 -4
0.0016 -2.795880017344075 -3
0.0032 -2.494850021680094 -3
0.0064 -2.193820026016113 -3
0.0128 -1.8927900303521317 -2
0.0256 -1.5917600346881504 -2
0.0512 -1.2907300390241692 -2
0.1024 -0.989700043360188 -1
0.2048 -0.6886700476962069 -1
0.4096 -0.3876400520322256 -1
0.8192 -0.08661005636824444 -1
1.6384 0.21441993929573674 1
3.2768 0.5154499349597179 1
6.5536 0.8164799306236992 1
13.1072 1.1175099262876804 2
26.2144 1.4185399219516615 2
52.4288 1.7195699176156427 2

As you can see, the error is never more than 1 (compare the loggy approximation with the actual log10). More importantly, thinking of loggy will help us intuit various log rules.

For example, if you multiply a 10 digit integer of a 20 digit integer (both positive), you can probably guess that the result will be about 30 digits. This is the intuition behind

$log10(10*20) == log10(10) + log10(20)$

Let's see this by multiplying a 16 digit number by an 11 digit number:

In [17]:
A = 1259061235607506
B = 12498123469
In [18]:
# approximate
print(loggy(A), loggy(B))
print(loggy(A * B))
print(loggy(A) + loggy(B))
16 11
26
27
In [19]:
# actual calculation
print(log10(A), log10(B))
print(log10(A * B))
print(log10(A) + log10(B))
15.10004685293527 10.096844810749097
25.19689166368437
25.19689166368437

Although the intuition may be less obvious for very small positive numbers, the multiplication rule still holds:

$log10(X*Y) = log10(X) + log10(Y)$

Before, we ran in trouble because we multiplied so many probabilities <1 together that we ultimately rounded to 0 for the likelihood, but now we can use this log rule to compute the log likelihood. This will let us add the log of all the individual probabilities together to get the final log likelihood.

How is the log of the likelihood useful? Well, if we compute it for two models, we can figure out which one is more likely to generate the string in question. We never cared about a precise likelihood calculation anyway, we only wanted to make this comparison to choose the best language model that best explains the text.

Let's monkey patch in the method to compute the log likelihood:

In [20]:
def log_prob(self, text):
    #p = 1
    logp = 0 # log10(1)
    for i in range(len(text)-1):
        # p *= self.graph.get_edge(text[i], text[i+1])
        logp += log10(self.graph.get_edge(text[i], text[i+1]))
    return logp

LangProfile.log_prob = log_prob

As a test, lets make sure that taking the log of our old result (with prob) gives us the same answer the new approach, where we sum the logs of each individual probability:

In [21]:
english = LangProfile("english", texts["english"])
spanish = LangProfile("spanish", texts["spanish"])

print(log10(english.prob("house")))
print(english.log_prob("house"))
-3.7100972765937996
-3.7100972765937996

Yay, it works! In general, we'll always be seeing negative number for log likelihood, because all the likelihoods will be <1. A log likelihood of -2 means the model is better than if the log likelihood is -3.

Does this fix our earlier problem, when we were rounding to zero?

In [22]:
print(long_str)
print(english.prob(long_str))
print(spanish.prob(long_str))
print(english.log_prob(long_str))
print(spanish.log_prob(long_str))
this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. this is a sentence. 
0.0
0.0
-395.2220089854635
-417.16647550702294

Yes it does! Now we can tell the English model is more likely to produce a string like that than the Spanish model.

Prediction

So far, we've built a dense graph, then used that to build a Markov chain-based model, which we can train per language. These models can tell us the likelihood of producing a given string.

The last part is to get the likelihood for each language model for a given string, then predict the language of the string based on whichever model give the biggest likelihood. Let's do that with a LangPredictor class:

In [23]:
class LangPredictor:
    def __init__(self, profiles):
        self.profiles = profiles
        
    def predict(self, line):
        profile = max(self.profiles, key=lambda profile: profile.log_prob(line))
        return profile.name
    
    def percents(self, lines):
        counts = {p.name: 0 for p in self.profiles}
        for line in lines:
            counts[self.predict(line)] += 1
        for k in counts:
            counts[k] /= len(lines)
        return counts

p = LangPredictor([
    LangProfile("english", texts["english"]),
    LangProfile("spanish", texts["spanish"]),
    LangProfile("french", texts["french"]),
    LangProfile("italian", texts["italian"]),
    LangProfile("german", texts["german"]),
])

Let's try it for some simple strings:

In [24]:
p.predict("hello friends!")
Out[24]:
'english'
In [25]:
p.predict("hola amigos!")
Out[25]:
'spanish'

Let's do a more comprehensive test. We'll take the tests inputs (the English and Spanish wikipedia pages describing the giant panda), break them into sentences, then see what percentage of the sentences get classified as each language.

In [26]:
p.percents(texts["english-test"].split("."))
Out[26]:
{'english': 0.5642900670322973,
 'spanish': 0.07312614259597806,
 'french': 0.15112736136502133,
 'italian': 0.11151736745886655,
 'german': 0.09993906154783669}
In [27]:
p.percents(texts["spanish-test"].split("."))
Out[27]:
{'english': 0.09903381642512077,
 'spanish': 0.5628019323671497,
 'french': 0.13285024154589373,
 'italian': 0.10628019323671498,
 'german': 0.09903381642512077}

Not too bad! The predictor thinks a majority (56%) of the English sentences are actually English, and a majority (56%) of the Spanish sentences are actually Spanish. The mistakes are spread across languages. For example, 13% of the sentences in the Spanish article are classified as Fresh and 1% are classified as German.

There are certain things we could do to improve our accuracy:

  • replacing all non-English characters with "?" is a big disadvantage -- some characters will by themselves give a very strong hint about what language is used!
  • we learned our probabilities based on a single Wikipedia article. Why not train on thousands of articles?
  • we are only considering what letter is likely to come next after the previous letter. Why not compute the probability based on the past 2 or 3 letters?

Conclusions

In this reading, we learned about layered design. At the foundation, we built a dense graph, that represents edge data in a big table. On that, we build a class that models languages as Markov chains; likelihoods of long strings were so small that they got rounded to zero, so we used log likelihood instead. Finally, we built a predictor that uses multiple models to find the best fit for a string in an unknown language.

In [ ]: