Logs describe a series of events happening in time. When you run a flask application, capturing the output will be a log of requests. At the beginning of the semester, analyzed a log of git commits. When logs contain text, regular expressions can help us extract useful information.
In this reading, we'll use regular expressions to analyze the git history of this repo: https://github.com/tylerharter/cs320. After checking out the repo, we ran git log > git-log.txt
to produce the "git-log.txt" file we'll be using here.
We'll use regular expressions to extract the following:
b046b85da7f5d65ef2131eefe72a8c1a39c5d139
is an example of a valid commit number. It is 40 characters long, and contains only the 16 hexidecimal digits (0-9 and a-f). For test cases, we'll put that example as the first line is a valid commit, and the following ones are not (because they are too long, contain invalid characters, or are too short -- in that order).
tests = """
b046b85da7f5d65ef2131eefe72a8c1a39c5d139
0000000000000000000000000000000000000001234
b046b85da7f5d65ef2131eefe72a8c1a39c5dzzz
b046b85da7f5d65ef2131eefe72a8c1a39c5d
"""
The character class that matches a digit is [0-9a-fA-F]
. Let's try matching 40 of those.
import re
for match in re.findall(r"[0-9a-fA-F]{40}", tests):
print(match)
Do you see what happened? Although 0000000000000000000000000000000000000001234
is not a valid commit number, as it is 43 characters instead of 40, the regex is recognizing the first 40 characters as a valid regex. This probably isn't what we want.
To fix it, we'll use a new metacharacter, \b
. According to the Python docs, this "matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters." (word characters are [a-zA-Z0-9_]
, so commit numbers conveniently consist of word characters).
import re
for match in re.findall(r"\b[0-9a-fA-F]{40}\b", tests):
print(match)
Nice! Now that we have developed and tested the regex we want, we can use on our big log file from the git repo:
with open("git-log.txt") as f:
log = f.read()
print(log[:500] + "...")
commits = re.findall(r"\b[0-9a-fA-F]{40}\b", log)
print(f"Found {len(commits)} commits. Here are the first 10:")
commits[:10]
We see email address in the log, like this:
commit 24c509d20afed69592442f9d1de97131dab6bd3a
Author: tylerharter <tylerharter@gmail.com>
Date: Mon Mar 1 15:59:16 2021 -0600
debug 5
It looks like we want everything between <
and >
.
emails = re.findall(r"<.*>", log)
emails[:10]
Oops, we don't want to include those angle brackets (<
and >
). Let's put parentheses ((
and )
) around just the email part, with r"<(.*)>"
.
emails = re.findall(r"<(.*)>", log)
print(f"Found {len(emails)} emails. Here are the first 10:")
emails[:10]
If we want, we can throw these in a pandas Series, allowing us to count occurrences, plotting the result in a bar plot.
%matplotlib inline
import pandas as pd
pd.Series(emails).value_counts().plot.barh()
Note that some commits look like this:
commit 6f331b977075991eaafc892e2b551e1346cfb396
Merge: bdc058d dee374a
Author: Tyler Caraza-Harter <tylerharter@users.noreply.github.com>
Date: Wed Feb 10 08:29:15 2021 -0600
Merge pull request #25 from ch-shin/master
mini.zip update
The "Author" email address sometimes corresponds to the person who merged a pull request, not the original author. Can we find the actual git users behind the pull requests?
Can we extract ch-shin/master
from text like Merge pull request #25 from ch-shin/master
, as in the last example of the above section?
re.findall(r"Merge pull request #\d+ from \w+/\w+", log)
That looks promising! Let's put some parentheses around key parts, to extract what we want:
matches = re.findall(r"Merge pull request #(\d+) from (\w+)/(\w+)", log)
for match in matches:
print(match)
pr = match[0]
user = match[1]
branch = match[2]
We're looping over tuples with the three pieces of info we want: commit number, github user, and branch. We could pull that out with pr = match[0]
and similar, as in the above example. Alternatively, Python has a feature called unpacking that is useful in such cases. Instead of match
on the line we define our for
loop, we can list the three variables we want to fill from the tuple we are currently looping over.
for pr, user, branch in matches:
print(user)
Look at the date line of a commit entry in the log:
commit 24c509d20afed69592442f9d1de97131dab6bd3a
Author: tylerharter <tylerharter@gmail.com>
Date: Mon Mar 1 15:59:16 2021 -0600
debug 5
We see a date starts with "Date:" and some spaces (excluded) and ends with a four digit number, such as "2021" (included). Let's try the match.
text = "Date: Mon Mar 1 15:59:16 2021 -0600"
re.findall(r"Date:\s+(.*\d{4})", text)
Do you see the problem? Both Date: Mon Mar 1 15:59:16 2021
and Date: Mon Mar 1 15:59:16 2021 -0600
could theoretically match (depending on how many characters .*
matches), as both end in four digits. 0600
is timezone info, and let's assume we don't want that.
By default, the *
is greedy, meaning it prefers to match more characters. That's why we got the longer option. If we use *?
instead, it will no longer be greedy, and we'll get the shorter version, ending in the four digits of the year.
text = "Date: Mon Mar 1 15:59:16 2021 -0600"
re.findall(r"Date:\s+(.*?\d{4})", text)
Great, now we can test on the original data.
dates = re.findall(r"Date:\s+(.*?\d{4})", log)
print(f"Found {len(dates)} dates. Here are the first 10:")
dates[:10]
Regular expressions are very helpful for analyzing logs and other text-based data. Unfortunately, it's difficult to get a regular expression correct on the first attempt; we made several intuitive mistakes along the way in this reading before arriving at the correct expression. In such cases, it's difficult to troubleshoot when running on the big dataset, so do what we did here: create some simple examples and revise your regular expression until it works with those. Then go back and use it on your full dataset.