LING 515 – Statistical Natural Language Processing
|
Instructor: |
|
|
Office: |
355 Ross Hall |
|
Phone: |
294-3368 |
|
Email: |
pendar (at iastate) |
|
Required Text: |
Manning, C. and H. Schütze. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT. |
|
|
Automatic processing of natural languages has always been a great challenge for researchers in linguistics, computer science, and artificial intelligence. Since its inception, computer science has been preoccupied with natural language, and has sought input from a variety of disciplines, such as linguistics, logic, philosophy, mathematics, and statistics. This course introduces students to one of the most successful approaches to natural language processing (NLP). Statistical NLP is a rapidly growing field with many real-world applications and has become an integral part of computational linguistics. The course introduces students to the fundamental ideas and problems in the field.
The students will understand the fundamental theoretical infrastructure of natural language processing and the contributions of its underlying fields: computer science, linguistics, machine learning and statistics. They will also learn about the existing, emerging and possible real-world computer applications involving natural language interfaces. Some of topics covered in this course include: Text & Corpora, Automatic Text Categorization, Maximum likelihood models of language, N-gram models and statistical smoothing, Word Prediction, Hidden Markov Models for NLP, Part-of-Speech Tagging, Word-Sense Disambiguation, Document & Text Retrieval, Automatic Text Summarization.
|
Periodic Presentations |
50% |
|
Final Paper/Project |
50% |
com s 207 (or 227), stat 330 or equivalent
|
Week |
Date |
Topic |
||
|
1 |
M |
8/20 |
||
|
|
W |
8/22 |
||
|
2 |
M |
8/27 |
"A Maximum Entropy Approach to Identifying Sentence Boundaries" |
|
|
|
W |
8/29 |
N-gram Models over Sparse Data |
|
|
3 |
M |
9/3 |
Labor Day – No Class |
|
|
|
W |
9/5 |
Word Sense Disambiguation |
|
|
4 |
M |
9/10 |
|
|
|
|
W |
9/12 |
Markov Models, POS tagging, HMM Tutorial (by Andrew Moore) |
|
|
5 |
M |
9/17 |
Nick away – No Classes |
|
|
|
W |
9/19 |
||
|
6 |
M |
9/24 |
Oksana Yakhnenko and Jesse Lane on POS tagging | |
|
|
W |
9/26 |
Mohammed Al Qady & Mary Still on WSD | |
|
7 |
M |
10/1 |
Vector space models & Text Categorization |
|
|
|
W |
10/3 |
NLP
and Machine Learning Applications to Authorship Attribution |
Koppel et al. (2003), Koppel et al. (2004), Gamon (2004) |
|
8 |
M |
10/8 |
Hojun Jaygarl & Sean Chen on text categorization (classifying bug reports & identifying duplicate bug reports) | |
|
|
W |
10/10 |
|
Tarek Mahfouz, Hojun Jaygarl & Kevin Godby on text categorization |
|
9 |
M |
10/15 |
Latent Semantic Analysis |
|
|
|
W |
10/17 |
|
Vlad Sukhoy & Tarek
Mahfouz on LSA (Choi, 2000; Choi et al., 2001) LSA Links from Vlad: http://citeseer.ist.psu.edu http://www.cs.brown.edu/people http://lsi.argreenhouse.com http://lsa.colorado.edu/ |
|
10 |
M |
10/22 |
|
Mohammed Al Qady on Legal Ontologies (Lame, 2004; Saias & Quaresma, 2004) |
|
|
W |
10/24 |
Discourse Segmentation |
(Hearst, 1997) |
|
11 |
M |
10/29 |
Probabilistic Parsing |
|
|
|
W |
10/31 |
|
|
|
12 |
M |
11/5 |
Hojun Jaygarl on IR by semantic similarity (Hliaoutakis et al., 2006) | |
|
|
W |
11/7 |
|
Elena Cotos (Marcu & Echihabi); Oksana Yakhnenko on discourse structure (Sporleder & Lascarides, 2004) |
|
13 |
M |
11/12 |
Automatic Text Summarization |
|
|
|
W |
11/14 |
|
Kevin Godby (Automatic Text Summarization) |
|
14 |
M |
11/19 |
Thanksgiving Break – No Classes |
|
|
|
W |
11/21 |
||
|
15 |
M |
11/26 |
Automated Essay Scoring |
|
|
|
W |
11/28 |
|
Elena Cotos & Mary Still on essay scoring |
|
16 |
M |
12/3 |
|
Vlad Sukhoy & Jesse Lane (TBA) |
|
|
W |
12/5 |
|
|
The Linguistic Data Consortium (http://www.ldc.upenn.edu/) and the European Language Resources Association (http://www.elra.info/) are two of the most comprehensive linguistic data repositories available. ISU has a subscription to LDC. You can see what resources are available at the library by searching for “Linguistic Data Consortium” in the library catalog.
The Association for Computational Linguistics (http://www.aclweb.org/) and its affiliated organizations and special interest groups (SIGs) hold a number of high quality conferences every year. ACL has an online archive of the proceedings of all of its conferences online at http://acl.ldc.upenn.edu/.
Other related conferences where computational linguistic work is presented are AAAI, IEEE, and ACM. Every year there are also a large number of other related conferences worldwide.
Some of the best journals in field are Computational Linguistics, Natural Language Engineering, Literary and Linguistic Computing, Research on Language and Computation, Grammars, and Information Retrieval.
LinguistList (http://www.linguistlist.org/) is an online forum for job announcements, conference calls, and other information related to linguistics and computational linguistics.