Computational Linguistics & Phonetics Co
mputational Linguistics & Phonetics Fachrichtung 4.7Universit344t des Saarlan
des

Unlocking the Secrets of the Past: Text Mining for Historical Documents (WS 2008/09)


What? Projektseminar: Computational Linguistics (Bachelor and MSc)
Who? Caroline Sporleder   (csporled AT coli)
Martin Schreiber   (m.schreiber AT mx.uni-saarland.de)
When? Blockseminar, 16.02.-28.02.2009, 8:30-16:00
Where?   Geb. C7 2, Konferenzraum 2.11
Note: please come to the introductory meeting on 12, 2009, 18:15-19:00 (Geb. C7.2, Konferenzraum 2.11). Topics for the presentations will be assigned during this meeting! Also, please send us an email by December 15 to let us know that you're interested in joining the course.


Course Information
This course offers hands-on experience with specific text mining tasks, such as named entity recognition and disambiguation, relation extraction and template filling, segmentation of semi-structured text, automatic link detection between documents, error detection and correction etc. The text mining techniques will be implemented and tested on real-world examples from the cultural heritage domain, such as historical documents. The cultural heritage domain is a good testbed for NLP methods because a wealth of information in this domain is contained in raw unprocessed and often relatively unstructured texts (in contrast to the biomedical domain where a lot of data is already in a fairly structured form). Text mining can make such documents more accessible to researchers and laypersons alike. Moreover language change over time, unorthodox orthography, and errors introduced during digitisation (e.g. OCR errors) make this domain particularly challenging (and thus interesting!) for natural language processing.


Course Structure
This is an interdisciplinary course that is open for both students from Computational Linguistics and students from History. The aim is to design, implement and test practical NLP and text mining solutions to make historical documents more accessible. Possible topics include: detecting and correcting (OCR) errors , information extraction from historical manuscripts, finding links between documents, converting unstructured documents into searchable databases, knowledge discovery from historical documents.

The course consists of a theoretical and a practical part. In the theoretical part, students give a presentation on topics relevant to the course. In the practical part, small interdisciplinary groups will work on implementing a system that solves a real problem relevant for the documents discussed in the seminar.

Course Objectives
  • obtain hands-on experience with text mining techniques (design, implementation, testing)
  • learn about specific problems and challenges that arise when developing NLP tools for the cultural heritage domain
  • work in interdisciplinary teams (finding out what users of NLP technology want, communicating with non-experts, developing solutions according to user specifications)
Scheine (Coli)
  • Projektseminar (MSc/BSc): class presentation and practical work including a short report (additional oral exam can be arranged)
Stellung im Studienplan (Coli)
  • als Projektseminar im B.Sc.: Regelstudienzeit 5/6. Semester
  • as project seminar in M.Sc. Programm
Leistungspunkte (Coli)
  • als Projektseminar/project seminar(MSc/BSc) 5 CP