Results

My analyses looked to answer the following five questions:

  1. How have the number and breakdown of crimes in St. Paul, Minnesota changed over the past five years?

  2. How are crimes in St. Paul spatially distributed and which areas see the largest number of incidences?

  3. What percentage of annual traffic stops involve certain ethnicities, and where are each ethnicity's traffic stops concentrated in St. Paul?

  4. Is there a strong correlation between the number of liquor licenses in a precinct and the amount of graffiti seen in the precinct?

  5. Can certain characteristics of a precinct be utilized within a Least Squares Regression model to predict the number of annual crimes in the same area?



Question #1: How have the number and breakdown of crimes in St. Paul, Minnesota changed over the past five years?

Data on reported crime incidences in St. Paul from 2015 to 2019 were used. Crimes were broken down into five categories: property damage-related crimes (Vandalism, Graffiti, and Arson), crimes relating to physical attacks (Assault, Domestic Violence, and Discharge), crimes relating to stealing (Theft, Burglary, and Robbery), drug-related crimes, and Rape/Homicide. These categories were formed from combining multiple smaller categories. The categories "Prospective Police Visit" and "Community Outreach Event" were removed from consideration, as these incidences do not relate to any actual committed crime.


Figure 1.


The proportion of each year's total number of crime incidences that belongs to each of the five categories is approximately the same from 2015 to 2019, with each category's sample standard deviation ranging from as low as 0.086% for Rape/Homicide up to 0.79% for physical crimes. These findings indicate that none of the five categories have substantially increased or decreased in relative prominence over the past five years. The most common crime category is by far Theft/Burglary/Robbery, which accounted for between 13,635 and 15,377 crimes annually, or between 61.3% and 63.4% of annual crimes. Property-related crimes, physical crimes, drug-related crimes, and Rape/Homicide were the second to fifth most common categories, accounting for between 12.9% and 14.3%, 12.0% and 14.0%, 9.9% and 11.1%, and 0.59% and 0.8% annually.

The total number of annual crimes also saw little change between 2015 and 2019, with no clear positive or negative trend. 2017 stands out from the other years, however, as it reported an 11.73% drop in number of reported crimes from 2016 and 8.4% fewer crimes than the next lowest year, 2018. All of the categories besides Rape/Homicide contributed to this drop, with property-related crimes, physical crimes, Theft/Burglary/Robbery, and drug-related crimes each decreasing in total number of crimes by 663 (-18.87%), 141 (-4.12%), 1742 (-11.33%), and 341 (-13.69%), respectively. Given the fact that 2017 was the first year that the St. Paul Crime Incidence's database began widely tracking "Proactive Police Visits" and "Community Engagement Events", with the number of "Proactive Police Visits" increasing from 1870 in 2016 to 29835 in 2017, it is likely that this drop off is from inconsistencies in how different police officers reported incidences, resulting in far more incidences being reported as "Proactive Police Visits" than there should have been. Nonetheless, this substantial decrease in crime incidences will next be explored further to see if these decreases were uniform across the police grid or were concentrated in certain areas.

Based on the information presented in this graph, it is recommended that the St. Paul Police department allocates a majority of its community outreach efforts towards educating homeowners and businesses on how to best decrease the chances of their property being stolen or broken into, as these crimes make up the vast majority of St. Paul's crime incidences. Additional research should go into how to best educate St. Paul citizens about domestic violence and drug addiction, as these types of crimes also make up a substantial portion of St. Paul's crimes and have a longer lasting impact on its victims.


Question #2: How are crimes in St. Paul spatially distributed and which areas see the largest number of incidences?

For the purposes of mapping out the locations of different crimes throughout the city, the St. Paul Police Department has developed a police grid that breaks up St. Paul into two hundred precincts. Each incident report from 2016 to 2018 was grouped by the month and precinct number that it occurred in. The relative number of incidences in each precinct in a given month were used to visualize the number of crimes across St. Paul by coloring each precinct a different color, with darker shades of grey indicating that more crimes took place in the given precinct during that month.


Animation 1.


When viewing the animation from 01/2017 to 12/2017, it is clear that the total number of crimes in 2017 was substantially lower than in 2016 and 2018 due to a sharp downturn in crimes during January and February. These losses were fairly uniform across St. Paul, with 88% of precincts experiencing a decrease of up to 93 in their number of January crimes from 2016 to 2017 and an average decrease of 6.98 January crimes per precinct. Similar results were seen when comparing February, 2016 and February, 2017. When looking back at the 2017 Crime Incident Report data, the number of Proactive Police Visit's is also substantially smaller in January and February than in the other months, with the average number of Proactive Police Visits in January and February of 2017 being 466.50, 2423.70 less than the average for the rest of the year. The fact that this universal decrease in incidences also extended to Proactive Police Visits implies that my hypothesis of the decrease in crimes being due to a large number of crimes being misrepresented as Proactive Police Visits is false. It is likely that this drop in incidences at the beginning of 2017 is simply due to missing data.

Moving away from just 2017, this animation shows which precincts consistently produce the most crime and how the total number of crimes per month tends to change throughout the year. Using 04/2016 as an example, the ten most active precincts accounted for 20.41% of the total monthly crimes and were all concentrated in one of three areas. Precincts along University Avenue, in the heart of downtown, and near the Payne-Phalen neighborhood accounted for 11.14%, 3.91%, and 5.36% of the total monthly crimes, respectively. University Avenue, in particular, can be seen on the left side of the map by the nearly black precinct. Downtown is located in the middle of map, with Payne-Phalen being located just northeast of downtown. 2016 is also a good example of how the total number of monthly crimes changes throughout the year, with crimes increasing from a low in January up through August, then decreasing through November, and then spiking in December during the holiday season. In 2016, crimes increased on average by 285 per month from February to August, with December having 20.29% more crimes than the next busiest month.

While crimes are likely to occur in any precinct at any time of the year, this analysis leads me to recommend that the St. Paul Police Department focuses their resources and officers in the three areas previously mentioned, especially during the Summer months and December. Additional resources should also be allocated to improving the living conditions of the neighborhoods along University Avenue and near the Payne-Phalen neighborhood as, unlike the downtown area, these areas house many low income families and have very high numbers of crimes per person.


Question #3: What percentage of annual traffic stops involve certain ethnicities, and how do these percentages compare to each ethnicity's population percentage?

Data on each traffic stop conducted from 2009 to 2018 was collected and then grouped based on the year the stop was conducted in and the ethnicity of the driver. The "Other" category was not implemented until 2016.


Figure 2.


Over the past ten years, the demographics that account for the most and least number of traffic stops have stayed relatively stable. Each year either White or Black drivers have accounted for the largest percentage of traffic stops, with White drivers accounting for between 37.53% and 47.16% of stops and Black drivers accounting for between 32.95% and 42.36%. Asian drivers account for the third largest percentage of traffic stops, with between 9.90% and 13.51% being attributable to them. Latino and Native American drivers account for 5.64% to 8.48% and 0.99% to 2.97% respectively. White and Black drivers have the most volatile traffic-stop percentages with standard deviations of 2.89% and 3.28%, respectively. White drivers have seen little to no long-term change in their traffic stop percentage, while Black drivers have seen a steady decrease of approximately 0.68% per year from 2009 to 2018. The percentage of traffic stops involving Asian drivers has increased by about 0.25% per year, while Latino and Native American representation in traffic stops has decreased annually by 0.23% and 0.06%, respectively.

These statistics are made more meaningful when paired with census data. By comparing each ethnicity's traffic stop representation to their St. Paul population representation, we can explain some of the trends seen in Figure 2 and determine whether or not each ethnic group is represented by a proportionally fair percentage of the traffic stops in St. Paul. For instance, according to the US Census Bureau, the Asian American population in St. Paul increased by 22.73% from 2009 to 2014, which would help to explain why the percentage of traffic stops involving Asian drivers increased over the same time period. In addition, the population that identifies itself as being two or more races or some other race increased by 21.33% from 2009 to 2014. This could partially explain why the percentage of traffic stops involving Latino drivers steadily fell during this time period and continues to fall to this day, as individuals who identify as more than one race very commonly identify as Latino, making it likely that many drivers who used to identify as strictly Latino are now placed in the "Other" category. Asian, Latino, and Native American drivers are all accurately represented in terms of traffic stops. The two ethnic groups that are the least accurately represented in terms of traffic stops are White and Black drivers. In 2014, despite accounting for 58.61% of the population, White drivers only accounted for 38.18% of traffic stops. Black drivers, on the other hand, accounted for 39.51% of traffic stops in 2014 despite only representing 15.68% of the population.

These values indicate that the St. Paul Police Department might be implementing practices that are systematically biased against Black drivers, such as policing drivers in predominantly Black neighborhoods more frequently. I recommend that the St. Paul Police Department reviews the types of actions that its police officers can pull over vehicles for and ensures that they are applied fairly to all ethnic groups. In addition, I recommend that St. Paul's police officers ensure that they are patrolling each area of St. Paul a fair amount and do not stay in one neighborhood for too long. These changes can most easily be implemented if it is understood where each ethnic group is most likely to be pulled over in St. Paul. Analysis of where these different traffic stops occur was done by grouping each traffic stop from 2014-2018 by year and ethnicity. These totals were then mapped in a similar fashion as was done with Animation 1 on the police grid overtime, with darker colors indicating more traffic stops for the given ethnicity in that year.



Animation 2.



Based on this animation, the following can be said for each of the five ethnic groups:

Based on these findings, I recommend that St. Paul's police department reevaluates the volume of time and resources being allocated to patrolling the traffic in the mentioned areas, especially University Avenue and the Payne-Phalen neighborhood, just for traffic stops.


Question #4: Is there a strong correlation between the number of liquor licenses in a precinct and the amount of graffiti seen in the precinct?

Data on the number of liquor licenses issued and the total graffiti-related costs from 2013 to 2015, including both material and labor costs, were collected for each precinct. Precincts were further grouped into four categories:

  1. On-Sale: The precinct has at least one holder of a liquor license that is allowed to sell liquor on-sale, meaning purchasers may consume the liquor at the site of purchase. This includes establishments like bars, clubs, and some theatres.

  2. Malt-Off-Sale: The precinct has no on-sale liquor licenses, but at least one license where the holder is allowed to sell wine on-sale and beer/malt off-sale, meaning purchases are not allowed to consume the product at the site of purchase.

  3. Liquor-Off-Sale: The precinct has no on-sale liquor licenses, but at least one license where the holder is allowed to sell all types of liquor products off-sale. These licenses are primarily held by grocery stores.

  4. No Licenses: The precinct has no liquor licenses issued within its borders.


Figure 3.



I initially predicted that there would be a moderately-strong positive relationship between the number of liquor licenses in a precinct and the amount of graffiti that in the same precinct, as I believed that precincts that sold more alcohol, especially on-sale, would lead to the precinct's alcohol consumers making more poor choices regarding vandalism. Looking at Figure 3, however, it is clear that little to no linear relationship exists between these two variables. If a Least-Squares linear model were to be used to predict graffiti costs vs number of liquor licenses using each of the two hundred precinct's data, the correlation and coefficient of determination would be calculated to be 0.05128 and 0.2693%, meaning that there is a nearly nonexistent positive correlation between the two variables and that the number of liquor licenses per precinct can only be used to explain less than one percent of the variation in the total graffiti costs per precinct.

These findings also hold when just looking at the On-Sale precincts, with a correlation of only 0.08912. Since Malt-Off-Sale and Liquor-Off-Sale precincts only take on two and one x-axis value respectively, it is difficult to conclude whether or not these precincts exhibit a relationship between the two variables. Given the previously-mentioned results, however, it is safe to assume these categories also showcase little correlation between number of liquor licenses and total graffiti costs.


Question #5: Can certain characteristics of a precinct be utilized within a Least Squares Regression model to predict the number of annual crimes in the same area?

St. Paul's Open Information data sets provide a number of characteristics for each precinct, including some already mentioned, such as the number of traffic stops and liquor licenses, as well as others, such as the number of bike paths, vacant buildings, and crashes. I will explore whether or not these variables, known as the explanatory variables, can be used within the context of a Least-Squares regression model to predict the annual number of crime incidences for a given precinct, known as the response variable. Based on the model's in-sample performance, I will either recommend or not recommend the model in its current state to be used by the St. Paul police department and will recommend further steps that can be taken to improve the model's performance.

Variable Exploration

Before developing a linear model, it is helpful to investigate the correlations between each of the previously mentioned variables and the number of annual crimes per precinct. For instance, to determine the correlation between the number of vacant buildings and number of annual crimes, each precinct's crime and vacant building count for 2019 was collected and plotted. A least squares regression line was also fitted to the data.



Figure 4.



Figure 4 shows that there is a moderately strong positive correlation of 0.408 between the number of vacant buildings in a precinct and the annual number of crimes committed in the precinct. This makes sense, as precincts with more vacant buildings are more likely to have a lower median income, thus pushing more people to commit crimes. It is also possible that precincts with a high crime rate are more likely to have more vacant buildings, as families and businesses move to avoid the more dangerous conditions. The coefficient of determination of 0.1665 implies that 16.65% of the random variation seen in the annual number of crimes per precinct can be explained by the linear model relating annual crime count to vacant building count. These values indicate that the number of vacant buildings in a precinct is certainly important in understanding the differences in crime rates among precincts, but also that more variables will be needed to further explain these differences.

Figure 5 below presents a correlation matrix illustrating the correlations between each combination of the following variables: Annual Number of Crime Incidences, Annual Number of Traffic Stops, Number of Vacant Buildings, Number of Liquor Licenses, and Number of Bikeways. Squares that are darker red indicate that the two variables in question have a stronger positive correlation, while darker blue squares indicate a stronger negative correlation. Data from 2018 was used in these comparisons.


Figure 5.



It is clear from the correlation matrix that four of the five explanatory variables have at least a moderately strong positive correlation with annual crime count, with the only exception being the number of bikeways in the precinct. The variable that is the highest correlated with annual crime count is the number of annual crashes with a correlation of 0.7356. The annual number of traffic stops, crashes, vacant buildings, and liquor licenses make up the second through fourth most strongly correlated variables with annual crime count, with correlations of 0.5789, 0.3802, and 0.2937, respectively. This correlation matrix indicates that many of the variables being analyzed have the potential to be useful when predicting the annual crime count for a precinct.

Just because four of the five explanatory variables included in the correlation matrix exhibit moderately strong positive correlations with annual crime counts, however, does not mean that all four of the variables will be useful in predicting annual crime counts. Looking at the other squares in the correlation matrix, many combinations of the four explanatory variables also exhibit moderately strong positive correlations with each other. Notable examples include the annual number of traffic stops and number of vacant buildings, which have a correlation of 0.4289, and the annual number of traffic stops and annual number of crashes, which have a correlation of 0.5138. Since each of the explanatory variables besides the number of bike paths is moderately correlated with at least one other explanatory variable, it is possible with each of them that their correlation with the annual number of crimes in a precinct is not due to the variable's own ability to explain variation in the response variable, but more so due to the variable's correlation with another variable that is actually explaining the variation in the response variable. As a result, it is possible that these explanatory variables, while all correlated with the response variable, might be providing the same or similar information about the response variable, meaning only some of them may be useful for prediction.

Model Selection

The data set for this model was comprised of two hundred observations, one for each of the precincts during the year 2018. Each precinct's number of crime incidences, traffic stops, crashes, vacant buildings, bike paths, and liquor licenses, all for the year 2018, comprised the one response variable and five possible explanatory variables. 2018 was the only year used because it was the only year where all five of the possible explanatory variables had available data. A process known as Forward-Stepwise Selection was used to determine which of the five variables would be included in the final model. Forward-Stepwise selection seeks to choose the best linear model for each number of possible variables, with each chosen model containing the variables featured in all of the smaller chosen models. When applied to this data set, Forward-Stepwise Selection chose the following models:


Model SizeAdded Variable
1 Annual Number of: Crashes
2 Annual Number of: Traffic Stops
3 Annual Number of: Vacant Buildings
4 Annual Number of: Bikeways
5 Annual Number of: Liquor Licenses



The best of the five models was selected by comparing each model's adjusted coefficient of determination (R2a). R2a differs from the conventional coefficient of determination (R2) by penalizing the model for having too many unnecessary variables. When adding an additional variable to a model, its R2 will always increase, even if the variable has no relationship with the response variable. This means that, when comparing prospective models based solely on R2, the model with the most variables will always have the highest coefficient and will be chosen even if some of its variables are unnecessary. R2a, by penalizing models for having a large number of variables, ensures that a model will only have a high coefficient if all of its variables are useful when explaining the response variable. The following graph plots each of the five model's R2 and R2a values.



Figure 6.



As expected, as the number of variables in the model increases, its R2 also increases. This is also the case with R2a, as the model with only one variable had an R2a of 0.5388, which climbed to 0.6062 when all five variables were included. This implies that the model with all five variables is the preferred model for predicting the number of annual crimes in a given precinct.

Before moving forward with the chosen five-variable model, I will explore the possibility of using an alternative type of model known as a Principal Component Regression model. Principal Component regression attempts to find one or more linear combinations of the explanatory variables that can be used in place of the original variables. These linear combinations, also known as principal components, make it possible to improve the model's out-of-sample performance by decreasing the number of coefficients that need to be fit in the model, decreasing the likelihood of overfitting to the data. Models with one through five principal components were fitted and had their R2 and R2a's calculated. These values are plotted below.



Figure 7.



Based on Figure 7, the best Principal Component regression model is one with five principal components. When a Principal Component regression model has the same number of principal components as the number of original variables, which in this case is five, the Principal Component regression model produces the same predictions as the Least Squares regression model with all of the variables included. This can be seen by the fact that both the Least Squares regression model with five variables and the Principal Component regression model with five principal components have the same R2 and R2a of 0.6161 and 0.6062, respectively.

Since the best models from the two fitting techniques produce the same estimates, the final model chosen will be the Least Squares model will all five explanatory variables included, as it is much simpler to interpret. This model takes on the following form for a given precinct's data:


Annual Crime Count = 41.2699 + 38.0596*(Annual Crash Count) + 0.1248*(Annual Traffic-Stop Count) + 16.1190*(Vacant Building Count) + -1.8681*(Bike Path Count) + 4.6294*(Liquor License Count)

Model Performance

While a model's performance would normally be quantified by how well it is able to predict on out-of-sample observations, I am unable to do this as I only have two hundred data points from a single year of collection, which is already very few to fit a model on, let alone leave some out for out-of-sample validation. As a result, the best we can do to analyze this model's ability to estimate a precinct's annual crime count is to look at its R2 and its residual plot. The final model's R2 is 0.6163, implying that the model is able to explain 61.63% of the variation in annual crime counts for St. Paul's precincts. A residual is the difference between an observations actual response variable amount and its predicted amount based on the given model. Each of the two precinct's 2018 crime count was plotted on the x-axis, with each precinct's residual being plotted on the y-axis.



Figure 8.



Two characteristics of the final model's residual plot indicate that the final model is not a good fit for this data set, despite the model having a high R2 value. The first characteristic is that the residuals show a positive linear trend as the x-axis value increases. If the final model was a good fit for this data set, we would expect the residuals to display no distinct patterns and to have equal spread above and below the x-axis at each x-axis value. The fact that the residuals tend to increase as the precinct's 2018 crime count increases implies that the final model is biased in its predictions and that one or more variables that would help to explain the variation in the response variable are being left out, hurting the model's predictive capabilities. The second characteristic is that the spread of the residuals at a given x-axis value increases as the x-axis value increases. One of the assumptions made when fitting a Least Squares regression model is that the distribution of response values at each combination of explanatory variable values has the same variance. This assumption is known as homoscedasticity. The fact that the spread of residuals seen at different crime count values is not constant across the x-axis indicates that the assumption of homoscedasticity does not hold, meaning that the model exhibits heteroscedasticity and implying that this Least Squares linear regression model is not a good fit for the data.

Model Recommendations

While methods exist to alleviate the issues brought on by heteroscedasticity, such as variable transformations and weighing observations by different amounts when fitting the model, the bias present in the model indicates that, in its current form, it is not a viable method by which to estimate a precinct's annual number of crime incidences. I do not recommend that the St. Paul police department uses this model to better understand the crime landscape of its city.

The following are three steps that could be taken to improve this model: