Silvio Amir | Natural Language Processing

Course Description Readings Grading & Assignments Schedule

Mon & Thu 11.45am @ Snell Engineering Center 108

Instructor	Contact		Office Hours
Silvio Amir	email	zoom	Mon 4pm - 6pm

Teaching Assistants
Pavan Guduru	email	zoom	Tue 10a-12p
Aadesh Mallya	email	zoom	Wed 4-6p
Pratyusha Parashar	email	zoom	Thu 3-5p
Harshkumar Modi	email	zoom	Thu 5-7p
Mili Parikh	email	zoom	Fri 10a-12p
Smit Shah	email	zoom	Fri 4-6p

Announcements/Discussions @ Piazza

Course Description

The widespread adoption of digital information systems and the Web led to a deluge of text data from a variety of genres, languages, and domains (e.g., news articles, tweets, clinical notes in Electronic Health Records). How can we use computers to help us sift through and make sense of all this data? What can we learn from analyzing natural language data at scale?

Natural Language Processing is a subfield of Artificial Intelligence that uses methods from Computer Science, Computational Linguistics, Cognitive Science, Statistics, and Machine Learning to give computers the ability to automatically analyze, categorize, understand, and generate natural language. This is challenging because unlike other kinds of language (say, programming languages) natural language is unstructured and often ambiguous, nuanced and subjective. In this course we will learn about:

the linguistic phenomena that make NLP hard for computers to approach
the main NLP problems and tasks, and strategies to address them
the role of data and machine learning in NLP systems
the ethical considerations and potentials for bias in NLP systems
how to formulate and evaluate NLP solutions to address real-world problems

The course will be very much hands-on with a great emphasis on methods, meaning that we will spend most of our time discussing and implementing (often from scratch) typical approaches to solve key NLP tasks. Recent advances in neural networks and deep learning models led to remarkable breakthroughs is NLP. However, this is intended to be an introductory course and thus we will not be jumping straight into these models. Instead, we will first become familiarized with classical statistical methods which are both important baselines for NLP tasks and lay the foundation for more sophisticated approaches. Then, in the second half of the semester, we will dive deep into neural networks starting from simple Multilayer Perceptrons and building our way up to state-of-the-art models based on pre-trained Transformers.

Prerequisites

This class has no official prerequisites, however, modern NLP relies heavily on statistical methods, machine learning and deep learning. Therefore, students must be comfortable with basic mathematical concepts from Linear Algebra, Probability and Calculus. We will briefly review some of the main concepts needed for the models and algorithms that we will cover in the class. However, this will be a review and NOT a thorough and rigorous exposition of these subjects. Students are thus encouraged to proactively fill any gaps in their knowledge.

We will make extensive use of python3, scientific computing libraries (e.g., numpy, scipy, matplotlib), and jupyter notebooks. If you have less experience working with python and notebooks, we highly encourage you to make time to come to office hours in the first few weeks of the course.

Readings


Main Text Speech and Language Processing 3rd Edition Dan Jurafsky and James H. Martin Occasionally, we will supplement this text with readings from research papers and other freely available sources.

Dive into Deep Learning Aston Zhang, Zachary C. Lipton, Mu Li and Alexander J. Smola

Mathematics for Machine Learning Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong

Grading and Assignments

Homework (50%)
Research paper presentation (CS6120 students only; this will count as an additional homework)
1. Reading and presenting research papers in small groups
2. Reviewing presentations from your peers
Quizzes (20%)
Final project (20%)
1. proposal
2. code & write-up
3. video presentation
4. presentation review
5. reflection
Class participation (10%)
1. Asking and answering questions in class
2. Completing lecture notebooks

Homework is due Fridays at 11pm and quizzes are due Sundays at 11pm. Notebooks for a given week can be submitted until Sunday 11pm.

Notes: Homework assignments have a small number of additional questions for CS6120 students (which count as extra-credit for CS4120). You can think of quizzes as mini take-home exams. The questions will typically focus on material covered in class that week, and may have a small number of questions based on readings for the upcoming week. You can make a request to retake one of the quizzes.

Late Policy

All homework should be turned in on time whenever possible. However, you can turn in your work up to 2 days late (48 hours from the deadline) with a penalty: 25% for the first day and 50% for the second day. Once a semester, you may turn in your homework up to 24 hours late without penalty or explanation — no other extensions will be granted. This policy only applies to homework; quizzes and final project may not be completed after the deadline. If you miss a quiz, make sure to attempt the extra-credit on the homework.

Collaboration Policy

We encourage you to collaborate with your classmates, but remember that collaboration is different than working in pairs or as a group. A few key points to remember:

Strategies: You may talk with your classmates about general strategies but you may not talk about specific solutions.
Explaining concepts: You may talk with your classmates about how certain techniques work in general but not how to write any part (or sub-part) of the solution needed for the homework.
Online solutions: You are expected to use the internet as a place for online resources, such as documentation, not as a place to get solutions to your assignments. This includes posting to sources like StackExchange, Reddit, Chegg, etc.
Plagiarism: assignments and code that you turn in should be written entirely on your own. You should always consult the course instructional staff if you need extra help. A good rule of thumb: do not share your assignments and do not look at your classmates assignments (this makes it very hard to come up with your own solution afterwards); do not write code together unless the assignment explicitly states that you may work in pairs (this includes explaining your solutions).

Collaboration Policy violations will result in a 0 on the assignment in question. The university’s academic integrity policy discusses actions regarded as violations and consequences for students.

Letter Grades

A		95% - 100%
A-		90% - 94%
B+		87% - 89%
B		83% - 86%
B-		80% - 82%
C+		77% - 79%
C		73% - 76%
C-		70% - 72%
D+		67% - 69%
D		63% - 66%
D-		60% - 62%
F		< 60%

Schedule

This a tentative schedule and subject to change.

Date	Lecture	Readings	Due
9/8	NO CLASS
9/12	1. Syllabus, Intro to NLP		Register on Piazza
9/15	2. Text Normalization	SLP: 2.2, 2.4, 2.5
9/18			Quiz 1: syllabus, tokenization
9/19	3. Intro to Statistical Learning: Math Refresher	D2L: 2.3, 2.4, 2.6, 19.1-19.8 MML: 2, 3, 5, 6
9/22	4. Intro to Statistical Learning II: Models, Data, Evaluation	SLP: 4.7,4.8, 4.10 D2L: 1.1-1.4 MML: 8
9/23			Homework 1.1: How many words do you know? Homework 1.2: Test a chatbot
9/26	5. N-gram Language Models	SLP: 3.1-3.4
9/29	6. Text Classification and Naive Bayes	SLP: 4.1- 4.6D2L:19.9
10/2			Quiz 2 - Vocabularies, Normalization, Language Models, Text Classification
10/3	7. Linear Models I	SLP: 5.1, 5.2, 5.3 D2L: 3.1, 4.1
10/6	8. Linear Models II	SLP:5.4, 5.5, 5.6 D2L: 3.6, 3.7
10/7			Homework 2: Language Models
10/10	NO CLASS: Indigenous People Day
10/13	9. Vector Space Semantics	SLP:6.1-6.6
10/14			Reading Research Papers
10/16			Quiz 3 - Naive Bayes
10/17	10. Word Embeddings	SLP: 6.8-6.11 D2L: 15.1-15.7
10/20	11. Sequence Labeling and Hidden Markov Models	SLP: 8.1, 8.2, 8.3. 8.6
10/23			~~Quiz 4 - Logistic Regression and Word Embeddings~~
10/24	12. Guest Lecture: Ethics I
10/27	13. Guest Lecture: Ethics II
10/28			Homework 3: Text Classification
10/31	14. Conditional Random Fields and Viterbi Algorithm	SLP: 8.4, 8.5
11/3	15. Multilayer Perceptron Neural Language Models	SLP: 7.1, 7.2, 7.3, 7.5 D2L: 5.1, 5.2
11/6			Quiz 4 - Logistic Regression and Word Embeddings
11/7	16. Training Neural Networks	SLP: 7.4 D2L: 5.3
11/10	17. Recurrent Neural Networks I	SLP: 9.1, 9.2, D2L: 9
11/13			Research Paper Presentations (CS6120 only) Quiz 5: POS, HMMs and Neural Networks
11/14	18. Recurrent Neural Networks II	SLP: 9.3, 11.1, 11.2, 11.4, 11.8 D2L: 10, 11.1-11.5
11/17	19. Transformers	SLP: 9.4, 11.5, 11.6 D2L: 11.6-11.9
11/18			Final Project proposal
11/20			Homework 4: Neural Networks and Word Embeddings Value Sensitive Design (Optional for extra-credit)
11/21	20. Pretrained Language Models	D2L: 15.8-15.10
11/24	NO CLASS: Thanksgiving
11/28	21. Pretrained Language Models II	D2L: 16
12/1	22. Applications I
12/5	23. Applications II
12/9			Project Due