dbdc3

Dataset for DBDC3

Below shows the dataset for DBDC3. This dataset includes both development and evaluation data. After the DBDC3 workshop, we revised some of the annotations. The dataset below include the revised annotations.

Download

Main data (DBDC3.zip)

This dataset was made by DBDC3 Task Organizers. The data can be used for both profit and nonprofit purposes under the MIT license. The data contain both English and Japanese dialogues with dialogue breakdown annotations for each system utterance.

English data

The following four datasets were used as development data in DBDC3.

CIC_115: 115 dialogues collected in the human evaluation round of the Conversational Intelligence Challenge (CIC). We randomly selected a subset of dialogues and used the initial part of the dialogues for annotation.
IRIS_100: 100 dialogues selected from the WOCHAT dataset.
TKTK_100: 100 dialogues selected from the WOCHAT dataset.
YI_100: 100 dialogues collected by using a chatbot developed at the Moscow Institute of Physics and Technology.

The following four datasets were used as evaluation data in DBDC3.

CIC_50: 50 dialogues collected by CIC.
IRIS_50: 50 dialogues provided by the developer of IRIS.
TKTK_50: 50 dialogues we (the organizers) collected by using TickTock.
YI_50: 50 dialogues we collected in the same way as YI_100.

Each dialogue in CIC_115 and CIC_50 was collected by showing a context represented by a short paragraph to a user before the dialogue.

Japanese data

The following three datasets were used as evaluation data in DBDC3.

DCM: 50 dialogues we collected by using NTT DOCOMO’s chat API.
DIT: 50 dialogues we collected by using DIT (Denso IT Laboratory’s system).
IRS: 50 dialogues we collected by using IRS (IR-status based system).

You can also refer to here for additional datasets in Japanese used in DBDC1 and DBDC2.

Revision after DBDC3

There are two folders, “dbdc3” and “dbdc3_revised”, in the data folder. “dbdc3” is the one used for the DBDC3 workshop and “dbdc3_revised” is the one we revised after the workshop.

The four datasets, CIC_115, YI_100 in dbdc3/en/dev/ and CIC_50, YI_50 in dbdc3/en/test/ were re-annotated and are stored under dbdc3_revised folder. In the original data, each annotator was allowed to annotate a part of a dialogue; however, in the revised data, each annotator was obliged to annotate all utterances of a dialogue in a row. This revision slightly increased the inter-annotator agreement.

Context data (DBDC3_context.zip)

This dataset was made by DBDC3 Task Organizers by using Stanford Question Answering Dataset (SQuAD). The data can be used for both profit and nonprofit purposes under the CC BY-SA 4.0 license. The data contain short paragraphs used as context for DBDC3.

Contact

If you have any questions or comments, please contact us by dbdc3-organizers@googlegroups.com.