Spring 2016 -- CSCE 670: Information Storage and Retrieval



Meeting times and location      MWF 13:50 am - 14:40 am, CHEN 104

Course Description and Prerequisites:

It is the information age! We face challenges brought by large-scale and unstructured information on open systems such as the Web or social media. Through this course, we'll study the theory, design, and implementation of text-based and Web-based information retrieval systems, including an examination of web and social media mining algorithms and techniques at the core of modern search and data mining applications.

Prerequisites: This is a graduate level course. While there are no official pre-requisites, it may be beneficial for students to have had previous exposure to linear algebra and basic probability theory. You should be able to design and develop large programs and learn new software libraries on your own.

Learning Outcomes:

The goal of this course is deriving a comprehensive understanding of fundamental issues, techniques, applications and future directions of Information Retrieval. In particular, by the end of the semester students will be able to:

  • Define and explain the key concepts and models relevant to information storage and retrieval, including efficient text indexing, boolean, vector space and probabilistic retrieval models, relevance feedback, document clustering and text categorization, Web search, including crawling, indexing, and link-based algorithms like PageRank.
  • Design, implement, and evaluate the core algorithms underlying a fully functional web search / data mining system, including the indexing, retrieval, and ranking components, as well as advanced algorithms like document clustering and text categorization.
  • Identify the salient features and apply recent research results in web search and data mining, including topics such as collaborative filtering, adversarial information retrieval, location-based services, and social information management.

Instructor Information:

Name Xia "Ben" Hu
Telephone number 979-845-8873
Email address hu@cse.tamu.edu
Office hours Mon 3-5pm, or by appointment
Office location 330B H.R. Bright Building

Textbook and/or Resource Material:

Primary Textbook:

"Introduction to Information Retrieval"
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze
Cambridge University Press. 2008. Available at Cambridge University Press, at Amazon, and other fine booksellers.

Other Textbooks:
"Mining of Massive Datasets"
Anand Rajarman and Jeffrey D. Ullman

"Data-Intensive Text Processing with MapReduce"
Jimmy Lin and Chris Dyer

"Networks, Crowds, and Markets: Reasoning About a Highly Connected World"
David Easley and Jon Kleinberg

Grading Policies:

Grading Scale: A = 90-100%, B = 80-90%, C = 70-80%, D = 60-70%, F = below 60%.

The course grading policy is as follows:

  • Class participation and quizzes - 5%
  • Homework assignments - 20%
  • Project - 30%
  • Exams - 45%

Attendance and Make-up Policies:

Ten quizzes will be randomly taken in the semester and are used to measure the attendance as well. Seven quizzes are required for full score. As long as more than seven quizzes are received successfully, no extra evidence is needed. Otherwise an excused absence is required. If the number of attendances is less than seven, we will deduct one point for each absence. The specific excused absences and rules can be found at http://student-rules.tamu.edu/rule07

Course Topics:

This course will mainly cover the following topics:

  • Statistical properties of text
  • Vector space model
  • Statistical language models
  • Learning to rank
  • Recent evaluation, NDCG, using clickthrough
  • Network essentials, network measures, hubs and authorities, PageRank
  • Homophily, Social Influence, Reciprocity
  • Classification, naive Bayes, kNN, SVM
  • Clustering, K-means, community detection
  • Recommender systems

Other Pertinent Course Information:

Homework: In addition to some regular homework exercises (assignments and quizzes), students are encouraged to participate in classroom discussions and Q&A.

Project: Students are expected to work on some programming projects. We will discuss the format in our first class. The evaluation of the project consists of progress report, project presentation and/or demonstration, and a written report.

Americans with Disabilities Act (ADA):

The Americans with Disabilities Act (ADA) is a federal anti-discrimination statute that provides comprehensive civil rights protection for persons with disabilities. Among other things, this legislation requires that all students with disabilities be guaranteed a learning environment that provides for reasonable accommodation of their disabilities. If you believe you have a disability requiring an accommodation, please contact Disability Services, currently located in the Disability Services building at the Student Services at White Creek complex on west campus or call 979-845-1637. For additional information, visit http://disability.tamu.edu.

Academic Integrity:

"An Aggie does not lie, cheat, or steal, or tolerate those who do."

Upon accepting admission to Texas A&M University, a student immediately assumes a commitment to uphold the Honor Code, to accept responsibility for learning, and to follow the philosophy and rules of the Honor System. Students will be required to state their commitment on examinations, research papers, and other academic work. Ignorance of the rules does not exclude any member of the TAMU community from the requirements or the processes of the Honor System. For additional information please visit: http://www.tamu.edu/aggiehonor/

Academic Integrity:

The course materials have been copied or adapted from the previous editions of CSCE 670, taught by Professor James Caverlee.