Syllabus
Welcome to Intro to Big Data Systems! We'll deploy and use distributed systems to store and analyze large datasets. Unstructured and structured approaches to storage will be covered. Analysis will involve learning new query languages, processing streaming data, and training machine learning models. Systems covered include Docker, PyTorch, HDFS, Spark, Cassandra, Kafka, and more.
Revisions to Syllabus
- none yet
Learning Objectives
- Deploy distributed systems for data storage and analytics
- Demonstrate competencies with tools and processes necessary for loading data into distributed storage systems
- Write programs that use distributed platforms to efficiently analyze large datasets
- Produce meaning from large datasets by training machine learning models in parallel or on distributed systems
- Measure resource usage and overall cost of running distributed programs
- Optimize distributed analytics programs to reduce resource consumption and program runtime
- Demonstrate competencies with cloud services designed to store or analyze large datasets
Lecture
We meet 3 times a week -- see the lecture schedule here.
I'll ask questions during lecture via TopHat. Though in-person attendance is not required, you can earn extra credit by answering these correctly. Answering TopHat questions remotely is not permitted.
Readings
We'll be learning about many different big data systems, and so no textbook closely corresponds to the lecture content. Thus, attending lectures and taking notes will be your primary resource.
We will have recommended (though optional) readings for many systems, however. We'll select from O'Reilly text books because you can read them free online via the Madison Public Library. You just need to do the following:
- get a library card (free)
- sign into the O'Reilly collection with your card number
- search for the assigned book
Here are some of the main texts we'll reference this semester:
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (1st edition), by Martin Kleppmann
- Learning Spark: Lightning-Fast Data Analytics (2nd edition), by Jules Damji et al.
- Cassandra: The Definitive Guide, (Revised) Third Edition: Distributed Data at Web Scale 3rd Edition (3rd edition), by Jeff Carpenter et al.
- Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python, by Sebastian Raschka et al.
Sometimes we may post lecture notes too.
Communication
We message the class regularly via Canvas announcements. We recommend updating your Canvas settings so that the "Announcement" option is "Notify immediately" so that you don't miss something important.
See the help page for details about how to contact us.
We have various forms for us to leave (optionally anonymous) feedback, report lab attendance, and thank TAs.
Course Components
Grading breakdown
- 3 Exams (16% each, 48% total)
- 12 quizzes (12% total)
- 8 programming projects (5% each, 40% total)
Grade thresholds will be as follows:
- A >= 94
- AB >= 90
- B >= 82
- BC >= 72
- C >= 65
- D >= 60
- F < 60
There will be opportunities to earn a maximum of 4% extra credit (for TopHat, Instructor Endorsements on Piazza, etc).
Exams
These will be multiple choice and taken in person. Exams 1+2 will be during class, and exam 3 will be during finals week. All exams are cumulative.
If you must miss an exam (e.g., due to illness), the others will receive greater weight, without scaling. For example, if you only do exam 1 and 3, then those will each be worth 24% (assuming you were explicitly excused from exam 2 by the instructor). If you are excused from taking both exams 1+2, then exam 3 will be worth 48%. Exam 3 cannot be skipped (only rescheduled to a later date, if necessary).
Quizzes
There will be a short Canvas quiz due at the end of most Wednesdays. Make sure you know the rules regarding what is allowed and what is not.
Projects
See project policies here.
Academic Misconduct
Project Policies
Be sure to read and understand the full project collaboration policies here.
TopHat Policies
TopHat questions are intended for in-class participants. Students who submit any TopHat question remotely are not eligible for any extra credit for the course. We might notice this by passing around a sign-up sheet following a TopHat question.
Piazza Policies
Do not post project code snippets that are >5 lines long.
Exam Policies
- students who have not taken an exam yet may study/prep with other students who have not taken it yet; it is fine to collaborate on creating note sheets (when allowed)
- students who have taken an exam may not discuss/share with a student who has not taken the exam yet; unless you have first-hand knowledge that another student has taken an exam, assume they have not taken it
- you may not sit adjacent to anybody you have met or know (event slightly)
Quiz Policies
Allowed
- however much time you need
- discussing answers with classmates who are taking the quiz at the same time
- referencing texts, notes, or provided course materials
- searching online for general information
- running code
NOT allowed
- taking it more than once
- discussing answers with anybody outside of the course
- discussing with classmates who have already completed the quiz when you have not completed it yourself yet
- posting anything online about the quizzes
- using such material potentially posted by other students who broke the preceding rule
- getting TA/instructor help on quiz questions prior to the quiz deadline
Recommendation Letters
Earning a recommendation letter is much harder than earning an A in this course. At a minimum, I'll want to see you doing something complex and interesting beyond the assingments. For a typical letter, I'll have collaborated with a student on some project for multiple months, with many iterations of feedback.
Most grad schools require recommenders to fill long forms rating students on various abilities (see an example below). Make sure that if you're asking me, I would be able to fill such a form without needing to put "I don't know" as my answer to many of the questions.