LING 515 – Statistical Natural Language Processing

Table of Contents

Course Information

Instructor:

Nick Pendar

Office:

355 Ross Hall

Phone:

294-3368

Email:

pendar (at iastate)

Required Text:

Manning, C. and H. Schütze. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT.


Introduction

Automatic processing of natural languages has always been a great challenge for researchers in linguistics, computer science, and artificial intelligence. Since its inception, computer science has been preoccupied with natural language, and has sought input from a variety of disciplines, such as linguistics, logic, philosophy, mathematics, and statistics. This course introduces students to one of the most successful approaches to natural language processing (NLP). Statistical NLP is a rapidly growing field with many real-world applications and has become an integral part of computational linguistics. The course introduces students to the fundamental ideas and problems in the field.

Course Outcomes

The students will understand the fundamental theoretical infrastructure of natural language processing and the contributions of its underlying fields: computer science, linguistics, machine learning and statistics. They will also learn about the existing, emerging and possible real-world computer applications involving natural language interfaces. Some of topics covered in this course include: Text & Corpora, Automatic Text Categorization, Maximum likelihood models of language, N-gram models and statistical smoothing, Word Prediction, Hidden Markov Models for NLP, Part-of-Speech Tagging, Word-Sense Disambiguation, Document & Text Retrieval, Automatic Text Summarization.

Evaluation


Periodic Presentations

50%

Final Paper/Project

50%


Prerequisites

com s 207 (or 227), stat 330 or equivalent

Syllabus

Week

Date

Topic

1

M

8/20

Introduction


W

8/22

Text & Corpora

2

M

8/27

"A Maximum Entropy Approach to Identifying Sentence Boundaries"



W

8/29

N-gram Models over Sparse Data

3

M

9/3

Labor Day – No Class


W

9/5

Word Sense Disambiguation

4

M

9/10



W

9/12

Markov Models, POS tagging, HMM Tutorial (by Andrew Moore)

5

M

9/17

Nick away – No Classes


W

9/19

6

M

9/24

Klein & Manning, Lafferty et al., Oksana's Slides

Wang & Schuurmans, Jesse's Slides

Oksana Yakhnenko and Jesse Lane on POS tagging


W

9/26

Stevenson & Wilks, Ide & Véronis

Mohammed Al Qady & Mary Still on WSD

7

M

10/1

Vector space models & Text Categorization


W

10/3

NLP and Machine Learning Applications to Authorship Attribution

Koppel et al. (2003), Koppel et al. (2004), Gamon (2004)

8

M

10/8

Joachims (1998); Basu et al. (2003)

Hojun Jaygarl & Sean Chen on text categorization (classifying bug reports & identifying duplicate bug reports)


W

10/10


Tarek Mahfouz, Hojun Jaygarl & Kevin Godby on text categorization

9

M

10/15

Latent Semantic Analysis


W

10/17


Vlad Sukhoy & Tarek Mahfouz on LSA (Choi, 2000; Choi et al., 2001)

LSA Links from Vlad:
http://citeseer.ist.psu.edu/deerwester90indexing.html
http://www.cs.brown.edu/people/th/papers/Hofmann-UAI99.pdf
http://lsi.argreenhouse.com/lsi/LSIpapers.html
http://lsa.colorado.edu/

10

M

10/22


Mohammed Al Qady on Legal Ontologies (Lame, 2004; Saias & Quaresma, 2004)


W

10/24

Discourse Segmentation

(Hearst, 1997)

11

M

10/29

Probabilistic Parsing


W

10/31

12

M

11/5

Hojun Jaygarl on IR by semantic similarity (Hliaoutakis et al., 2006)


W

11/7


Elena Cotos (Marcu & Echihabi); Oksana Yakhnenko on discourse structure (Sporleder & Lascarides, 2004)

13

M

11/12

Automatic Text Summarization


W

11/14


Kevin Godby (Automatic Text Summarization)

14

M

11/19

Thanksgiving Break – No Classes


W

11/21

15

M

11/26

Automated Essay Scoring


W

11/28


Elena Cotos & Mary Still on essay scoring

16

M

12/3


Vlad Sukhoy & Jesse Lane (TBA)


W

12/5


Sources of Data

The Linguistic Data Consortium (http://www.ldc.upenn.edu/) and the European Language Resources Association (http://www.elra.info/) are two of the most comprehensive linguistic data repositories available. ISU has a subscription to LDC. You can see what resources are available at the library by searching for “Linguistic Data Consortium” in the library catalog.

Other Resources

The Association for Computational Linguistics (http://www.aclweb.org/) and its affiliated organizations and special interest groups (SIGs) hold a number of high quality conferences every year. ACL has an online archive of the proceedings of all of its conferences online at http://acl.ldc.upenn.edu/.

Other related conferences where computational linguistic work is presented are AAAI, IEEE, and ACM. Every year there are also a large number of other related conferences worldwide.

Some of the best journals in field are Computational Linguistics, Natural Language Engineering, Literary and Linguistic Computing, Research on Language and Computation, Grammars, and Information Retrieval.

LinguistList (http://www.linguistlist.org/) is an online forum for job announcements, conference calls, and other information related to linguistics and computational linguistics.

Recommended Papers for Presentation

Word Sense Disambiguation

Word Prediction & Text Input

Computational Morphology

Text Categorization & Segmentation

Summarization

Question Answering

Statistical Machine Translation

Automated Essay Scoring