Lexico-semantic resources play an essential role in natural language processing and related applications such as information retrieval. Unfortunately, their construction is extremely costly and rarely guided by practical considerations, posing a problem especially for less-resourced languages. One possible solution is to rely on crowdsourcing of lexico-semantic resources. Although crowdsourcing has proven to be a viable option for reducing the overall costs, there still does not exists a comprehensive crowdsourcing methodology for incremental construction of large-scale lexico-semantic resources.
This projects aims to fill this gap by investigating the computational models and methods for incremental and cost-efficient crowdsourcing of lexico-semantic resources. The research will combine dynamic crowdsourcing, corpus-based models of semantics (distributional semantics and topic models), and active machine learning methods into a comprehensible and language-independent crowdsourcing framework, the SenseHive.
The SenseHive consists of a flexible, graph-based representation of senses and lexico-semantic relations (SenseGraph), coupled with an incremental construction methodology. In SenseGraph, senses are dynamically split up and merged based on the analysis of human judgments on corpus-extracted data. In the first phase, we will implement a prototype of the SenseHive framework and use it for focused evaluation experiments on Croatian, Slovene, and English data to answer the relevant research questions. As a proof of concept, in the second phase we will use SenseHive to construct a medium-sized lexico-semantic resource for Croatian by enlarging and enriching existing lexico-semantic resources. The proposed research will advance the state of the art in computational lexical semantics and semi-automated construction of linguistic resources, and yield a lexico-semantic resource for Croatian of great practical value.
Principal investigator : Assoc. Prof. Jan Šnajder