Natural Language Processing

CS4120/6120 Fall 2022

Course Description   Readings   Grading & Assignments   Schedule


Mon & Thu 11.45am @ Snell Engineering Center 108

Instructor   Contact     Office Hours
Silvio Amir   email zoom   Mon 4pm - 6pm
           
Teaching Assistants          
Pavan Guduru   email zoom   Tue 10a-12p
Aadesh Mallya   email zoom   Wed 4-6p
Pratyusha Parashar   email zoom   Thu 3-5p
Harshkumar Modi   email zoom   Thu 5-7p
Mili Parikh   email zoom   Fri 10a-12p
Smit Shah   email zoom   Fri 4-6p


Announcements/Discussions @ Piazza


Course Description

The widespread adoption of digital information systems and the Web led to a deluge of text data from a variety of genres, languages, and domains (e.g., news articles, tweets, clinical notes in Electronic Health Records). How can we use computers to help us sift through and make sense of all this data? What can we learn from analyzing natural language data at scale?

Natural Language Processing is a subfield of Artificial Intelligence that uses methods from Computer Science, Computational Linguistics, Cognitive Science, Statistics, and Machine Learning to give computers the ability to automatically analyze, categorize, understand, and generate natural language. This is challenging because unlike other kinds of language (say, programming languages) natural language is unstructured and often ambiguous, nuanced and subjective. In this course we will learn about:

  • the linguistic phenomena that make NLP hard for computers to approach
  • the main NLP problems and tasks, and strategies to address them
  • the role of data and machine learning in NLP systems
  • the ethical considerations and potentials for bias in NLP systems
  • how to formulate and evaluate NLP solutions to address real-world problems

The course will be very much hands-on with a great emphasis on methods, meaning that we will spend most of our time discussing and implementing (often from scratch) typical approaches to solve key NLP tasks. Recent advances in neural networks and deep learning models led to remarkable breakthroughs is NLP. However, this is intended to be an introductory course and thus we will not be jumping straight into these models. Instead, we will first become familiarized with classical statistical methods which are both important baselines for NLP tasks and lay the foundation for more sophisticated approaches. Then, in the second half of the semester, we will dive deep into neural networks starting from simple Multilayer Perceptrons and building our way up to state-of-the-art models based on pre-trained Transformers.

Prerequisites

This class has no official prerequisites, however, modern NLP relies heavily on statistical methods, machine learning and deep learning. Therefore, students must be comfortable with basic mathematical concepts from Linear Algebra, Probability and Calculus. We will briefly review some of the main concepts needed for the models and algorithms that we will cover in the class. However, this will be a review and NOT a thorough and rigorous exposition of these subjects. Students are thus encouraged to proactively fill any gaps in their knowledge.

We will make extensive use of python3, scientific computing libraries (e.g., numpy, scipy, matplotlib), and jupyter notebooks. If you have less experience working with python and notebooks, we highly encourage you to make time to come to office hours in the first few weeks of the course.

Readings

   
Main Text
Speech and Language Processing 3rd Edition
Dan Jurafsky and James H. Martin

Occasionally, we will supplement this text with readings from research papers and other freely available sources.
   
Dive into Deep Learning
Aston Zhang, Zachary C. Lipton, Mu Li and Alexander J. Smola
   
Mathematics for Machine Learning
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong

Grading and Assignments

  • Homework (50%)
  • Research paper presentation (CS6120 students only; this will count as an additional homework)
    1. Reading and presenting research papers in small groups
    2. Reviewing presentations from your peers
  • Quizzes (20%)
  • Final project (20%)
    1. proposal
    2. code & write-up
    3. video presentation
    4. presentation review
    5. reflection
  • Class participation (10%)
    1. Asking and answering questions in class
    2. Completing lecture notebooks

Homework is due Fridays at 11pm and quizzes are due Sundays at 11pm. Notebooks for a given week can be submitted until Sunday 11pm.

Notes: Homework assignments have a small number of additional questions for CS6120 students (which count as extra-credit for CS4120). You can think of quizzes as mini take-home exams. The questions will typically focus on material covered in class that week, and may have a small number of questions based on readings for the upcoming week. You can make a request to retake one of the quizzes.

Late Policy

All homework should be turned in on time whenever possible. However, you can turn in your work up to 2 days late (48 hours from the deadline) with a penalty: 25% for the first day and 50% for the second day. Once a semester, you may turn in your homework up to 24 hours late without penalty or explanation — no other extensions will be granted. This policy only applies to homework; quizzes and final project may not be completed after the deadline. If you miss a quiz, make sure to attempt the extra-credit on the homework.

Collaboration Policy

We encourage you to collaborate with your classmates, but remember that collaboration is different than working in pairs or as a group. A few key points to remember:

  • Strategies: You may talk with your classmates about general strategies but you may not talk about specific solutions.
  • Explaining concepts: You may talk with your classmates about how certain techniques work in general but not how to write any part (or sub-part) of the solution needed for the homework.
  • Online solutions: You are expected to use the internet as a place for online resources, such as documentation, not as a place to get solutions to your assignments. This includes posting to sources like StackExchange, Reddit, Chegg, etc.
  • Plagiarism: assignments and code that you turn in should be written entirely on your own. You should always consult the course instructional staff if you need extra help. A good rule of thumb: do not share your assignments and do not look at your classmates assignments (this makes it very hard to come up with your own solution afterwards); do not write code together unless the assignment explicitly states that you may work in pairs (this includes explaining your solutions).

Collaboration Policy violations will result in a 0 on the assignment in question. The university’s academic integrity policy discusses actions regarded as violations and consequences for students.

Letter Grades
A   95% - 100%
A-   90% - 94%
B+   87% - 89%
B   83% - 86%
B-   80% - 82%
C+   77% - 79%
C   73% - 76%
C-   70% - 72%
D+   67% - 69%
D   63% - 66%
D-   60% - 62%
F   < 60%


Schedule

This a tentative schedule and subject to change.

Date Lecture Readings Due
9/8 NO CLASS
9/12 1. Syllabus, Intro to NLP Register on Piazza
9/15 2. Text Normalization SLP: 2.2, 2.4, 2.5
9/18 Quiz 1: syllabus, tokenization
9/19 3. Intro to Statistical Learning: Math Refresher D2L: 2.3, 2.4, 2.6, 19.1-19.8
MML: 2, 3, 5, 6
9/22 4. Intro to Statistical Learning II: Models, Data, Evaluation SLP: 4.7,4.8, 4.10
D2L: 1.1-1.4
MML: 8
9/23
Homework 1.1: How many words do you know?

Homework 1.2: Test a chatbot
9/26 5. N-gram Language Models SLP: 3.1-3.4
9/29 6. Text Classification and Naive Bayes SLP: 4.1- 4.6D2L:19.9
10/2 Quiz 2 - Vocabularies, Normalization,

Language Models, Text Classification
10/3 7. Linear Models I SLP: 5.1, 5.2, 5.3
D2L: 3.1, 4.1
10/6 8. Linear Models II SLP:5.4, 5.5, 5.6
D2L: 3.6, 3.7
10/7 Homework 2: Language Models
10/10 NO CLASS: Indigenous People Day
10/13 9. Vector Space Semantics SLP:6.1-6.6
10/14 Reading Research Papers
10/16 Quiz 3 - Naive Bayes
10/17 10. Word Embeddings SLP: 6.8-6.11
D2L: 15.1-15.7
10/20 11. Sequence Labeling and Hidden Markov Models SLP: 8.1, 8.2, 8.3. 8.6
10/23 Quiz 4 - Logistic Regression and Word Embeddings
10/24 12. Guest Lecture: Ethics I
10/27 13. Guest Lecture: Ethics II
10/28 Homework 3: Text Classification
10/31 14. Conditional Random Fields and Viterbi Algorithm SLP: 8.4, 8.5
11/3 15. Multilayer Perceptron
Neural Language Models
SLP: 7.1, 7.2, 7.3, 7.5
D2L: 5.1, 5.2
11/6 Quiz 4 - Logistic Regression and Word Embeddings
11/7 16. Training Neural Networks SLP: 7.4
D2L: 5.3
11/10 17. Recurrent Neural Networks I SLP: 9.1, 9.2,
D2L: 9
11/13 Research Paper Presentations (CS6120 only)
Quiz 5: POS, HMMs and Viterbi
11/14 18. Recurrent Neural Networks II SLP: 9.3, 11.1, 11.2, 11.4, 11.8
D2L: 10, 11.1-11.5
11/17 19. Transformers SLP: 9.4, 11.5, 11.6
D2L: 11.6-11.9
11/18 Final Project proposal
11/20 Homework 4: Neural Networks and Word Embeddings
Value Sensitive Design (Optional for extra-credit)
11/21 20. Pretrained Language Models D2L: 15.8-15.10
11/24 NO CLASS: Thanksgiving
11/28 21. Pretrained Language Models II D2L: 16
12/1 22. Applications I
12/5 23. Applications II
12/9 Project Due