The participant of this challenge will be given dialogue data between humans and machines, and develope algorithms that can detect dialogue breakdowns (points in dialogue where users cannot continue the conversation smooothly). You can use developement data (dialogue data with dialogue breakdown annotations) to develop your own dialogue breakdown detection algorithms.
For the challenge participants, at the time of the formal-run, dialogue data without dialogue breakdown annotations are provided. The participants will submit their dialogue breakdown detection results using their own algorithms.
If you are interested, you can have a look at this paper for the overview of the dialogue breakdown detection challenge.
The dialogues in the development data are annotated with dialogue breakdown detection labels. There are three types of labels: O (not a breakdown), T (possible breakdown), and X (breakdown). For most of the dialogues, each utterance of the dialogues are annotated by 30 annotators. The task of dialogue breakdown detection is to decide a single dialogue breakdown label (e.g., X) for an utterance as well as the distribution of O T, and X (e.g., [0.1, 0.2, 0.7]) for the utterance.
This section shows how to run a baseline detector that the organizers prepared.
To run the baseline program and the evaluation script, the following must be installed.
JavaSE-1.7+
Python 2.7.x
Execute the program on the command prompt (Windows) / Terminal (Mac, Linux).
You can check whether each program is correctly installed by executing the following command.
`$java -version`
`$python -V`
The baseline program uses words included in each utterance as features (Bag-of-Words) and detects dialogue breakdowns by using Conditional Random Fields (CRFs). This program outputs three kinds of labels, O (not a breakdown), T (possible breakdown,) and X (breakdown) with probability distributions; however, since this is a simple baseline, the output distribution is deterministic, i.e., 1.0 for the output label, 0.0 for the others.
By following the steps below, you can train a simple dialogue breakdown detector and evaluate its performance. This example uses IRIS dataset in the development data.
eval.py
in the same directory as the baseline program.baseline/train/
and baseline/test/
.
Here, we sorted the json files alphabetically and put the initial 50 files (from 1407219916log.json to 1408219169log.json) under train
and
the remaining 50 files (from iris_00014.log.json to iris_00105.log.json) to test
.baseline/
| DBDBaseline.jar
| eval.py
├ train/
| | iris_00014.log.json
| | iris_00016.log.json
| | ・・・
| └iris_00105.log.json
└ test/
| iris_00106.log.json
| iris_00107.log.json
| ・・・
└iris_00161.log.json
$java -jar DBDBaseline.jar -l ./train/ -p ./test/ -o ./out/ -t 0.5
This script evaluates the output of a dialogue breakdown detector. The script can be executed by the following steps.
$python eval.py -p ./test/ -o ./out/ -t 0.5
######### Data Stats #########
File Num : 50
System Utterance Num : 500
O Label Num : 315
T Label Num : 17
X Label Num : 168
######### Results #########
Accuracy : 0.440000 (220/500)
Precision (X) : 0.329004 (76/231)
Recall (X) : 0.452381 (76/168)
F-measure (X) : 0.380952
precision (T+X) : 0.790441 (215/272)
Recall (T+X) : 0.555556 (215/387)
F-measure (T+X) : 0.652504
JS divergence (O,T,X) : 0.468422
JS divergence (O,T+X) : 0.339554
JS divergence (O+T,X) : 0.333161
Mean squared error (O,T,X) : 0.241287
Mean squared error (O,T+X) : 0.306044
Mean squared error (O+T,X) : 0.293160
###########################
The meaning of the evaluation metrics can be found in the evaluation metrics section.
After this getteing-started section, I hope you can get the idea of what to do. We, the organizers, hope that many good dialogue breakdown detection algorithms be submitted at the formal-run.
See this section to get information about how to submit your runs at the formal-run.
Please also refer to the output format when preparing files for the formal-run. Note that this is the output format of the baseline program.