Word sketches, sense disambiguation and parallel corpora

Adam Kilgarriff - University of Brighton, UK  

Nancy Ide - Vassar College, USA

 Dan Tufiº – Institute for Artificial Intelligence, Romania

 

Word sketches are a response to the issue of how we summarise corpus data. They provide a list of the grammatical contexts a word commonly occurs in, and the statistically salient collocations for each grammatical relation. They provide an easily-readable account of a word's behaviour, suitable for students of the language, lexicographers, and the language technologist who needs to understand the behaviour of the words that are critical to an application. The tutorial will describe requirements for corpus data, parsing and grammatical relations, salience statistics, and uses in lexicography and for Word Sense Disambiguation (WSD), with reference to their implementation for English (see http://wasps.itri.brighton.ac.uk).  The goal of this part of the tutorial is that students take away the skills, understanding, and confidence to develop word sketches for their own language and their own corpora.

The second part of the tutorial goes into further details on WSD providing an overview of the most succesful approaches to WSD. Most of these are dealing with word sense disambiguation in a strictly monolingual environment. However, if one is interested in building a sense-disambiguated parallel corpus, this enterprise raises various issues besides disambiguating each of the monolingual sections of the corpus. The consistency is one of the most difficult to handle. Yet, parallel texts contain a lot of information, implicitly encoded by the human translators, which is under-exploited. We show, in the last part of the talk, how parallel texts can be used in word sense disambiguation and suggest ways towards building multilingual sense tagged corpora with monolingually distributed workload, to the benefit of all languages represented in a parallel corpus.

            The tutorial will be supported by on-line demos with automatic translation equivalents extraction in various pairs of languages and on the effective use of these translation dictionaries in sense clustering. To complete a WSD task, one should be able to assign sense clusters with specific identifiers selected from a predefined sense inventory. One possible way of tagging the sense clusters will be discussed and demonstrated by the use of Wordnet. The benefits of interlingually aligned wordnets (à la EuroWordnet or BalkaNet) will be argued for with respect to the WSD labelling problem.