DiSCo logo

Distributional Semantics and Compositionality

Accepted Papers Program Call for Papers Shared Task Dates Committee Organizers
ACL logo A Workshop at ACL/HLT 2011, in Portland, Oregon.
supported by
Theseus Program FZI Forschungszentrum Informatik
Google

new Full dataset of the shared task is available here.

new The list of accepted papers and the program are now online.

new Please use the ACL HLT 2011 instructions for preparing camera-ready papers. The instructions are available here http://www.acl2011.org/authors.shtml .

new Test data set for the shared task is available HERE .

new We are pleased to announce Dominic Widdows as the invited speaker at DiSCo'2011

Workshop Description

Any NLP system that does semantic processing relies on the assumption of semantic compositionality: the meaning of a phrase is determined by the meanings of its parts and their combination. However, this assumption does not hold for lexicalized phrases such as idiomatic expressions, which causes pain points not only for semantic, but also for syntactic processing, see (Sag et al. 2001). In particular, while distributional methods in semantics have proved to be very efficient in tackling a wide range of tasks in natural language processing, e.g., document retrieval, clustering and classification, question answering, query expansion, word similarity, synonym extraction, relation extraction, textual advertisement matching in search engines, etc. (see Turney and Pantel 2010 for a detailed overview), they are still strongly limited by being inherently word-based. While dictionaries and other lexical resources contain multiword entries, these are expensive to obtain, not available for all languages to a sufficient extent, the definition of a multiword varies across resources and non-compositional phrases are merely a subclass of multiwords. The workshop brings together researchers that are interested in extracting non-compositional phrases from large corpora by applying distributional models that assign a graded compositionality score to a phrase as well as researchers interested in expressing compositional meaning with such models. This score denotes the extent to which the compositionality assumption holds for a given expression. The latter can be used, for example, to decide whether the phrase should be treated as a single unit in applications. We emphasize that the focus is on automatically acquiring semantic compositionality. Approaches that employ prefabricated lists of non-compositional phrases should consider a different venue.

This event consists of a main session and a shared task.

References:

Ivan A Sag, Timothy Baldwin, Francis Bond, Ann Copestake, Dan Flickinger (2001): Multi-word Expressions: A Pain in the Neck for NLP. In Proc. of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), Mexico City, Mexico

Turney, P. and P. Pantel. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141-188.

Accepted Papers

(LINEAR) MAPS OF THE IMPOSSIBLE: CAPTURING SEMANTIC ANOMALIES IN DISTRIBUTIONAL SPACE
Eva Maria Vecchi, Marco Baroni and Roberto Zamparelli

DETECTING COMPOSITIONALITY USING SEMANTIC VECTOR SPACE MODELS BASED ON SYNTACTIC CONTEXT. SHARED TASK SYSTEM DESCRIPTION
Guillermo Garrido and Anselmo Peñas

DISTRIBUTED STRUCTURES AND DISTRIBUTIONAL MEANING
Fabio Massimo Zanzotto and Lorenzo Dell'Arciprete

EXEMPLAR-BASED WORD-SPACE MODEL FOR COMPOSITIONALITY DETECTION: SHARED TASK SYSTEM DESCRIPTION
Siva Reddy, Diana McCarthy, Suresh Manandhar and Spandana Gella

IDENTIFYING COLLOCATIONS TO MEASURE COMPOSITIONALITY: SHARED TASK SYSTEM DESCRIPTION
Ted Pedersen

MEASURING THE COMPOSITIONALITY OF COLLOCATIONS VIA WORD CO-OCCURRENCE VECTORS: SHARED TASK SYSTEM DESCRIPTION
Alfredo Maldonado-Guerra and Martin Emms

SHARED TASK SYSTEM DESCRIPTION: FRUSTRATINGLY HARD COMPOSITIONALITY PREDICTION
Anders Johannsen, Hector Martinez, Christian Rishøj and Anders Søgaard

SHARED TASK SYSTEM DESCRIPTION: MEASURING THE COMPOSITIONALITY OF BIGRAMS USING STATISTICAL METHODOLOGIES
Tanmoy Chakraborty, Santanu Pal, Tapabrata Mondal, Tanik Saikh and Sivaju Bandyopadhyay

TWO MULTIVARIATE GENERALIZATIONS OF POINTWISE MUTUAL INFORMATION
Tim Van de Cruys

Workshop Program

Friday June 24, 2011

9:20–9:30Opening
09:30–10:30Invited Talk by Dominic Widdows
10:30–11:00Morning break
11:00–11:40(Linear) Maps of the Impossible: Capturing Semantic Anomalies in Distributional Space
Eva Maria Vecchi, Marco Baroni and Roberto Zamparelli
11:40–12:05Distributed Structures and Distributional Meaning
Fabio Massimo Zanzotto and Lorenzo Dell’Arciprete
12:05–12:30Two Multivariate Generalizations of Pointwise Mutual Information
Tim Van de Cruys
12:30–14:00Lunch break
14:00–14:30Distributional Semantics and Compositionality 2011: Shared Task Description and Results
Chris Biemann and Eugenie Giesbrecht
14:30–14:50Exemplar-Based Word-Space Model for Compositionality Detection: Shared Task System Description
Siva Reddy, Diana McCarthy, Suresh Manandhar and Spandana Gella

14:50–15:10Identifying Collocations to Measure Compositionality: Shared Task System Description
Ted Pedersen
15:10–15:30Shared Task System Description: Measuring the Compositionality of Bigrams using Statistical Methodologies
Tanmoy Chakraborty, Santanu Pal, Tapabrata Mondal, Tanik Saikh and Sivaju Bandyopadhyay
15:30–16:00Afternoon break
16:00–16:20Detecting Compositionality Using Semantic Vector Space Models Based on Syntactic Context. Shared Task System Description
Guillermo Garrido and Anselmo Peñas
16:20–16:40Measuring the Compositionality of Collocations via Word Co-occurrence Vectors: Shared Task System Description
Alfredo Maldonado-Guerra and Martin Emms
16:40–17:00Shared Task System Description: Frustratingly Hard Compositionality Prediction
Anders Johannsen, Hector Martinez Alonso, Christian Rishøj and Anders Søgaard
17:00–17:30Wrap-Up Discussion

Call for Papers

TOPICS

For the main session, we invite submission of papers on the topic of automatically acquiring a model for semantic compositionality. This includes, but is not limited to:

Shared Task

The organizers extracted candidate phrases from two large-scale freely available web-corpora, UkWaC and DeWaC (cf. http://wacky.sslmit.unibo.it/), containing respectively English and German POS tagged text. These data have been manually evaluated for compositionality with Amazon Turk. Workers were presented a sentence with a bolded target phrase and were asked to score how literal the phrase was between 0 and 10. 4-5 different, randomly sampled sentences from the WaCKy corpora for UK English and German were presented to 4 workers each.

Phrases consist of two lemmas and come in three grammatical relations:

Phrases were extracted semi-automatically. The relations were assigned by patterns and manually checked for validity. Phrases were selected in a way as to balance the data set while controlling for frequency.

Judgments: numerical and coarse

Scores were averaged over valid (non-spam) judgments per phrase and normalized between 0 and 100. These give the numerical scores that are used for the Average Point Difference score. For coarse-grained scoring, phrases with numerical judgments between 0 and 25 were assigned the label "low". Numerical judgments between 38 and 62 received the label "medium", judgments over 75 got the label "high". All other phrases were removed from the coarse judgment data.

Data

Format
Judgments are provided in a three-column-format:

Volume
These are the total number of items by language and relation (items with coarse scores in parentheses):
English:

German: The complete data was split into 40% training, 10% validation and 50% test.
Updated training and validation data can be downloaded HERE.
The archive contains data sets for compositionality judgments for English and German as well as the official scoring scripts.

Test data can be downloaded HERE.
This archive contains the unlabeled test sets as well as random baselines as sample data.

Participants of the task are free to choose whatever method and data resources they will use in their submission. Prefabricated lists of multiwords are not allowed. Since the data set is derived from the WaCkY corpora, participants are strongly encouraged to use these freely available text collections to build their models of compositionality, thus ensuring the highest possible comparability of results. Furthermore, since the WaCkY corpora are provided already POS-tagged and lemmatized, the workload on the participants' side is considerably reduced. This information (POS tags and lemmatization) may or may not be used by the participants. If needed, additional linguistic annotations or processing may also be added to the corpora. For obtaining the WaCky corpora, please email us (disco2011workshop @ gmail.com) for instructions to minimize load on the WaCky organizers. Of course, you can also directly contact the WaCky community at http://wacky.sslmit.unibo.it/doku.php?id=start.

For the challenge, participants submit their system's output on the test set to the task organizers, who score the systems and provide the official scores. Two scoring scripts are provided with the training set:

  1. measuring the mean error by computing the average difference of system score and test data score;
  2. binning the test data scores into three grades of compositionality (non-compositional, somewhat compositional, compositional), ordering the system output by score and optimally mapping the system output to the three bins.
The motivation for (a) is to reproduce the training data scores, the motivation for (b) is to give credit to systems that order the phrases correctly by compositionality but scale scores differently -- something that is easily 'fixed' in applications by appropriate thresholds.
For questions, comments, discussions, please check our Google group for the shared task HERE

Important Dates

Test data release: March 31, 2011
Regular paper submission deadline:
new
April 1, 2011
April 8, 2011 (Timezone: UTC-12)
Test data submission and system description deadline: April 8, 2011 (Timezone: UTC-12)
Notification of acceptance: Apr 25, 2011
Camera-ready deadline: May 06, 2011
Post-ACL Workshop: June 24 , 2011

Program Committee

Contact

Workshop Chairs: Email: disco2011workshop AT gmail.com