Welcome to Data Programming II! In this course, we will learn object-oriented programming to create tree and graph data structures to represent hierarchical data and implement algorithms for efficiently searching these structures.

We'll often create our own datasets, using techniques like logging, benchmarking, web scraping, and A/B testing.

In the last third of the semester we'll explore some basic machine learning techniques, including regression, classification, clustering, and decomposition.

Revisions to Syllabus


We message the class regularly via Canvas announcements. We recommend updating your Canvas settings so that the "Announcement" option is "Notify immediately" so that you don't miss something important.

See the help page for details about how to contact us.

We have various forms for us to leave (optionally anonymous) feedback, report lab attendance, and thank TAs/mentors.


We meet 3 times a week -- see the lecture schedule here. In the past, I've always posted lecture recordings online, but I'm not making any guarantees. The problem I've seen recently is that people plan to watch online but then procrastinate and fall behind. If attendance is healthy and it feels like people are keeping up, I'll usually be posting recordings.


We'll post a weekly lab document. You can work through it individually, or with your assigned study group. TAs and mentors will walk around to answer questions that might arise.

If you have extra time at lab after completing the lab doc, you may leave early, or (probably better) work with your assigned study group on the project or quizzes.

We'll be taking lab attendance for participation with a combination of TopHat and passing around a sheet.


Grading breakdown

At the end, you'll have a score out of 100, and I'll set a curve based on where there are obvious breaks in the distribution and the difficulty this semester (there are not a capped numbers of A's, so you're not really competing with your peers).


Submission: Everybody will individually upload either a .py file or a .ipynb (as specified) file for each project with the submission tool.

Collaboration: Even though everybody will make their individual submission, every project will have (1) a group part to be optionally done with your assigned study group and (2) an individual part. For the group part, any form of help from anybody on your group is allowed (even looking at each other's code); I recommend you find times for everybody on the group to work at the same time so you can help each other through coding difficulties in this part. You're also welcome to do the "group" part individually, or with a subset of your assigned study group. For the individual part, you may only receive help from course staff (instructor/TAs/mentors); you may not discuss this part with anybody else (in the class or otherwise) or get help from them.

Late Policy: if you submit a version of your project on time that is scoring at least 50 percent with the tester, you may have up to 3 extra days to complete the project (no penalty and no explanation required). If you have other special circumstances (e.g., illness, family emergency), email me asking for accomodations. Before making such a request, submit current progress on the submission and link to your code (I'm more generous if I see somebody started early and made a good effort).

Code Review: A TA will give you detailed comments on specific parts of your assignment. This feedback process is called a "code review", and is a common requirement in industry before a programmer is allowed to add her code changes to the main codebase. Read your code reviews carefully; even if you receive 100% on your work, we'll often give you tips to save effort in the future.

Project Grading: Grades will be largely based on automatic tests that we run. We'll share the tests with you before the due date, so you should rarely be too surprised by your grade. Though it shouldn't be common, we may deduct points for serious hardcoding, not following directions, or other issues. Some bugs (called non-deterministic bugs) don't show up every time code is run -- if you have such an issues, we may give you a different grade based on the tester than what you were expecting based on when you ran it. Finally, our tests aren't very good at evaluating whether plots and other visualizations look how they should (a human usually needs to evaluate that).

Project Grading: The autograder will be run periodically during 2 days days prior to a project deadline (from Monday night if the deadline is on Wednesday and so on). Because of this, we expect you to try submitting your project early and make sure nothing crashes. However, this should not be a substitute for running locally. You should only try submitting once you pass the tests locally.

Allowed Packages: anything that comes pre-installed with Python may be used. Additionally, you may install and use the following if they're useful: jupyter, pandas, numpy, matplotlib, requests, beautifulsoup4, statistics, recordclass, sklearn, haversine, gitpython, graphviz, pylint, lxml, flask, bs4, html5lib, geopandas, shapely, descartes, click, netaddr, torch==1.4.0+cpu, torch vision=0.5.0+cpu. Using unapproved packages will result in a score of zero when submitted for grading because the autograder won't be able to run your code without those packages.


There will be a short Canvas quiz due at the end of most Wednesdays. Make sure you know the rules regarding what is allowed and what is not.

NOT allowed

Midterms and Final

These will be multiple choice exams taken in person. The midterms will be in class, the final will be at a different location (to be announced).


Some of the things that count towards participation:


We'll sometimes assign readings from the following sources (all free):


Yeah, of course you shouldn't cheat, but what is cheating? The most common form of academic misconduct in these classes involves copying/sharing code for programming projects. Here's an overview of what you can and cannot do:


NOT Acceptable

Citing Code: you can copy small snippets of code from stackoverflow (and other online references) if you cite them. For example, suppose I need to write some code that gets the median number from a list of numbers. I might search for "how to get the median of a list in python" and find a solution at

I could (legitimately) post code from that page in my code, as long as it has a comment as follows:

    # copied/adapted from
    def median(lst):
      sortedLst = sorted(lst)
      lstLen = len(lst)
      index = (lstLen - 1) // 2

      if (lstLen % 2):
        return sortedLst[index]
        return (sortedLst[index] + sortedLst[index + 1])/2.0

In contrast, copying from a nearly complete project (that accomplishes what you're trying to do for your project) is not OK. When in doubt, ask us! The best way to stay out of trouble is to be completely transparent about what you're doing.

Similarity Detection: of course, with about 400+ students, it's hard for a human TA to notice similar code across two submissions. Thus, we use automated tools to looks for similarities across submissions. Such similarity detection is an active area of computer science research, and the result is tools that detect code copying even when students methodically rename all variables and shuffle the order of their code. We take cheating detection seriously to make the course fair to students who put in the honest effort.

Recommendation Letters

Earning a recommendation letter is much harder than earning an A in this course. At a minimum, I'll want to see you doing something complex and interesting beyond the assingments. For a typical letter, I'll have collaborated with a student on some project for multiple months, with many iterations of feedback.

Most grad schools require recommenders to fill long forms rating students on various abilities (see an example below). Make sure that if you're asking me, I would be able to fill such a form without needing to put "I don't know" as my answer to many of the questions.