This is the README file accompanying release 1.0 of the Training data for the FrameNet version of SEMEVAL-2010 Task 10. This document was created on 11 December 2009 by Josef Ruppenhofer. It was last modified on 13 December 2009 by Josef Ruppenhofer ------------------- Table of Contents 1. Training data contents 2. Formats 3. Further info 4. Citing this dataset ------------------- = 1. What the training data release contains = I. The full-text annotation serving as training data II. A special, Semeval-specific release of FrameNet (r1.4alpha) == I. Training data == The text that serves as training data is taken from Arthur Conan Doyle's "The Adventure of Wisteria Lodge". Out of this lengthy, two-part story we annotated the second part, titled "The Tiger of San Pedro". This text is not subject to copy-right and was taken directly from the web. In what follows we sometimes refer to the training data by the shorthand "Tiger" or "Tiger annotations". The annotation was carried out on top of a constituency-parse tree generated by the Shalmaneser tool. Shalmaneser internally calls the Collins parser. We accepted the automatic parses and performed no corrections. The tool used to carry out the annotation is Salto. Shalmaneser and Salto are available from this site: http://www.coli.uni-saarland.de/projects/salsa/page.php?id=software Salto is described in this paper by Erk et al.: http://www.coli.uni-saarland.de/projects/salsa/papers/lrec06-tool.pdf The annotations were carried out by two experiences FrameNet annotators (Josef Ruppenhofer, Russell Lee-Goldman) as follows. A first pass of both frame-semantic and coreference annotation was carried out by one annotator. This first pass of annotation was then checked by the second annotator and all divergences were adjudicated by both annotators. In the final step, both annotators jointly performed the null-instantiation resolution. The frame inventory used for the training data is that of FN release 1.4 alpha, rather than that of the last official release, FN r1.3. The annotation schemes for the frame semantic, coreference and NI-resolution annotations are documented in the file annotation_guidelines.pdf. == II.FrameNet release 1.4alpha == To support the teams participating in the task, the FrameNet team has kindly provided us with a special version of its database. This Semeval-specific version of the database is not a complete, official release. It does not contain the lexical annotations as human-viewable html files and the checking and error correction procedures carried out for regular FN releases were not done for this special version. FN special release 1.4 alpha contains more frames and lexical units than r1.3. Though most of the information is the same as before, it should be noted that there have been changes e.g. to frame or frame element definitions and the assignment of lexical units to frames. This FN-release also contains some lexical units that are not (yet) officially included by FrameNet but which we wanted to be able to annotate in our texts and where we felt confident that we could reasonably assign them to frames. These LUs have ids above 20000 to make them clearly identifiable to Semeval participants. They were also included in frames in the frames.xml file. What FN release 1.4alpha contains: * inside the directory frXML, the usual frame and relations information (frames.xml , frRelation.xml, semtypes.xml) * inside the directory luXML, annotation files for lexical units * full-text corpus annotation files (American national corpus; Nuclear threat initiative) Additionally, we also generated a special file listing incorporated frame elements (incorporatedfes.txt). This file allows participants to find information on FE incorporations in a simple way rather than having to look it up in the lu annotation files. The info in the file is provided as follows: name of Lemma, name of Frame, name of incorporated Frame Elemen, all separated by tabs. gag.v Placing Theme NB: since FN release 1.4 alpha is a special-purpose release, we didn't get a new version of all the documentation files that come with a FN release. We therefore included the documents from the official r1.3b release (in the "fn14alpha_FN/r1.3docs" sub-directory). The dtd files for version r1.3 should work with the xml files of r1.4alpha. = Formats = * The Tiger training data from Conan Doyle are available only as Tiger/SALSA XML. They reside in the subdirectory "tiger". (The format name "Tiger/SALSA XML" has nothing to do with the title of our annotated data.) * The annotations of FN release 1.4 alpha are available in two formats, the original FN format (in the directory "fn14alpha_FN") and as Tiger/SALSA XML (in the directory "fn14alpha_Salsa"). The Tiger/SALSA XML format was generated by using the built-in conversion facility of the Shalmaneser tool. * The files that have to do with the structure of frames and their relations exist only in their original FN-specified format (in the dir "fn14StructureFiles"). == Additional remarks on the Tiger/SALSA XML format of the FN-release == Shalmaneser reads in a directory of lexical unit files and then outputs files of 2000 annotated sentences. That is, in the Tiger/SALSA XML format, annotations for different lexical units are combined in the same file. Note that when a lexical unit file contains only header-information but no annotated instances, no output for this lu is generated by Shalmaneser. We also post-processed the shalmaneser output somewhat for reasons that have to do, among others, with later deriving a propbank version of the annotations. * We renumbered the sentences across all the output files with consecutive numbers. The original FN-format sentence ids have a more complex format allowing for hyphens and two annotations on the same sentence for different targets would refer to the same sentence id. In our cleanup, such cases now have distinct sentence ids. * Since the lu annotations are grouped in files of 2000 instances, the annotation instances for some lexical units will be split across two such files. = Further info = * The evaluation script for the full task and the NI resolution task is NOT included here. At the time of this release, the script is still being tested. We will announce its availability on the google group that we created for task participants: http://groups.google.com/group/semeval2010-task10?pli=1 We also maintain a page on Task 10 at Saarland University. http://www.coli.uni-saarland.de/projects/semeval2010_FG/ * If you find any errors or find files to be missing, please get in touch via the google group or by writing directly to {josefr}_A@T_coli.uni-sb.de. * The Tiger/SALSA XML-format annotation files include information on the heads of non-terminals (phrases). Be aware that if you open these files in the SALTO tool and then _save_ them, SALTO throws out the head info. So if for some reason you do want to make changes to these files and keep the head info, please do not use SALTO for those edits. * Conan Doyle's text is British English whereas most of the FrameNet data is American English. In working with the British data, you may need to take into account the spelling differences between the varieties (e.g. colour versus color). Also, Doyle uses some now obsolete spellings such as to-night for "tonight". = Citing this dataset = If you make use of these data for purposes other than participation in the SemEval 2010 shared task "Linking Events and their Participants in Discourse" we would kindly ask you to refer to the following paper: Josef Ruppenhofer, Caroline Sporleder, Roser Morante, Collin Baker and Martha Palmer. "SemEval-2010 Task 10: Linking Events and Their Participants in Discourse". The NAACL-HLT 2009 Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-09), Boulder, Colorado, USA, June 4, 2009.