Zeals TECH BLOG

チャットボットでネットにおもてなし革命を起こす、チャットコマース『Zeals』を開発する株式会社Zealsの技術やエンジニア文化について発信します。

NTCIR-15 Dialogue Evaluation Task DialEval-1

f:id:tmjiang:20200914125203p:plain

A brief introduction and a simple approach

DialEval-1 consists of two subtasks: Nugget Detection (ND) and Dialogue Quality (DQ). They aim to evaluate customer-helpdesk dialogues automatically. The ND subtask is to classify whether a customer or helpdesk turn is a nugget, where being a nugget means that the turn helps towards problem solving; and the DQ subtask is to assign quality scores to each dialogue in terms of three criteria: task accomplishment, customer satisfaction, and efficiency. The official evaluation results of 18 runs are received from 7 teams. IMTKU (Dept. of Information Management, Tamkang University, Taiwan) and I, on behalf of ZEALS, by utilizing XLM-RoBERTa via HuggingFace Transformers, fastai, and blurr, achieve top-1 or top-2 in terms of various evaluation metrics for both ND and DQ.

NTCIR? DialEval? ND? DQ?

NII (National Institute of Informatics) Testbeds and Community for Information Access Research, a.k.a. NTCIR, is a series of evaluation workshops designed to enhance research in information access technologies, including information retrieval, question answering, text summarization, extraction, etc. It can be seen as an Asia-Pacific counterpart of TREC (Text REtrieval Conference; America), CLEF (Conference and Labs of the Evaluation Forum; Europe), and FIRE (Forum for Information Retrieval Evaluation; Southern Asia).

DialEval-1 is one of the initiatives, e.g., DSTC (Dialog System Technology Challenges), that study the complexity of the dialogue phenomenon and various dialogue-related problems. Unlike typical tasks of DSTC, however, DialEval-1 focuses on reducing the cost of corpus curation and data analytics. Although WOZ (Wizard-of-Oz), a common approach nowadays, can distribute the workloads of data collection, annotation, and evaluation to crowd-sourcing, the time spent on reaching an agreement for annotation/evaluation standard is nonetheless linear, if not worse. In other words, human annotation and evaluation have two types of issues:

  • Scalability: costly and hard to decentralize;
  • Measurability: likely unrepeatable/inconsistent even for the same system.

To overcome the issues, DialEval-1 proposes to assess dialogues automatically. Given a customer-helpdesk dialogue (Figure 1), for example, can a system predict which turn of the dialogue is helpful, and by how much?

f:id:tmjiang:20200914124852p:plain
Figure 1. An example of a dialogue between Customer © and Helpdesk (H). The original dialogue on Weibo in Simplified Chinese is on the right, and its English translation is on the left. [1][2]

Therefore, DialEval-1 continues ND and DQ subtasks of Short Text Conversation (STC-3) at NTCIR-14 [1], and further constructs a new test collection, such that the performance evaluation would be fair and realistic.

What to predict, exactly?

Allow me to briefly describe what the expected outcomes are, and how to evaluate them. There will be no math formulae. If the reader is interested in rigorous definitions, please kindly refer to the overview papers [1][2].

Nugget Detection

f:id:tmjiang:20200914125203p:plain
Figure 2. Nugget state transitions [1][2]

A nugget is a turn by either Helpdesk or Customer. It helps Customer transition from Current State (including Initial State) towards Target State (i.e., when the problem is solved). STC-3 and DialEval-1 organizers define the following 7 nugget types:

  • CNaN / HNaN: Customer or Helpdesk's non-nuggets that are irrelevant to the problem-solving situation;
  • CNUG / HNUG: Customer or Helpdesk's regular nuggets that are relevant to the problem-solving situation;
  • CNUG* / HNUG*: Customer or Helpdesk's goal nuggets that confirm and provide solutions, respectively;
  • CNUG0: Customer's trigger nuggets that initiate the dialogues with certain problem descriptions.

Dialogue Quality

Using subjective scores to quantify the quality of a dialogue as a whole. The organizers define 3 score types:

  • A-score: Accomplishment; has the problem solved? To what extent?
  • S-score: Satisfaction; how satisfied Customer is with the dialogue;
  • E-score: Effectiveness; how effective and efficient the dialogue is.

Each score is on a 5-point scale of ranks, ranging from -2 to 2.

Evaluation Metrics

Since the issues are about inconsistent human assessment, the gold standard of the datasets is not trivial classes or ranks but distributions. Roughly speaking, the distributions are votes from annotators for certain class/rank. Such that the organizers evaluate a system's prediction of class/rank by comparing how similar the predicted distribution to the gold standard's. More specifically, the metrics for ND are Root Normalized Sum of Squares (RNSS) and Jensen-Shannon Divergence (JSD). and the ones for DQ are Normalized Match Distance (NMD) and Root Symmetric Normalized Order-aware Divergence (RSNOD). Again, I will spare the readers with the math formulae, please see the papers if interested.

How do we approach the tasks?

Despite the architecture differences, almost all participants of STC-3 modeled ND and DQ as classification tasks. We adopt the same tactic for DialEval-1, and then pay more attention to tokenization and optimization. Because STC-3 showed that none of architectures from participants outperformed the baselines of BiLSTM with GloVe [1][2]. Therefore our approach is simply fine-tuning the state-of-the-art pre-trained models with well-established tricks of tokenization and optimization.

Tokenization

Based on our preliminary trials, we find that XLM-RoBERTa works well for both Simplified Chinese and English. Although the rationale behind it hasn't been fully examined, we speculate that the sentencepiece-based unigram subwords may have been helpful. Besides that, in order to simulate the turn structure of a dialogue, we not only utilize XLM-RoBERTa's special tokens, namely BOS (beginning of sentence; ), EOS (end of sentence; ), and SEP (separator of sentences; ), but also customize several tokens in fastai style to provide some minimal context. For example, consider a tokenized turn below:

xxlen ▁3 <s> xxtrn ▁1 xxsdr ▁customer ▁@ ▁China ▁Uni com ▁Customer ▁Service ▁in ▁Gu ang dong ▁Shi t ! ▁What ▁is ▁your ▁staff ▁service ▁doing ▁on ▁earth ? ▁I ▁have ▁called ▁the ▁staff ▁service ▁for ▁3 ▁hours , ▁but ▁no ▁one ▁answer ▁my ▁phone ▁call . ▁It ▁is ▁no ▁wonder ▁that ▁customer ▁evaluation ▁is ▁so ▁bad . ▁Shi t ! ▁I ▁am ▁at ▁Kang le ▁Middle ▁Road . </s>

Where xxlen and xxtrn stand for length of the dialogue in turns and the position of each turn of the dialogue, respectively. The numbers right next to them provide certain features of turns. The same trick goes with xxsdr that differentiates whether the sender is Customer or Helpdesk. When a turn's context says xxtrn _1 xxsdr _customer, the nugget type is almost definitely CNUG0.

As for DQ, a whole dialogue can be tokenized in a similar fashion, where xxlen could be useful for certain quality scores:

xxlen ▁3 <s> xxtrn ▁1 xxsdr ▁customer ▁@ ▁China ▁Uni com ▁Customer ▁Service ▁in ▁Gu ang dong ▁Shi t ! ▁What ▁is ▁your ▁staff ▁service ▁doing ▁on ▁earth ? ▁I ▁have ▁called ▁the ▁staff ▁service ▁for ▁3 ▁hours , ▁but ▁no ▁one ▁answer ▁my ▁phone ▁call . ▁It ▁is ▁no ▁wonder ▁that ▁customer ▁evaluation ▁is ▁so ▁bad . ▁Shi t ! ▁I ▁am ▁at ▁Kang le ▁Middle ▁Road . </s> </s> xxtrn ▁2 xxsdr ▁help desk ▁Hello ! ▁We ▁are ▁sorry ▁for ▁the ▁in con veni ence . ▁100 10 ▁is ▁our ▁service ▁hot ▁line . ▁We ▁may ▁not ▁answer ▁your ▁phone ▁call ▁during ▁the ▁busy ▁hour ▁of ▁tele traf fic . ▁We ▁sincer ely ▁ap ologi ze ▁for ▁that ! ▁What ▁can ▁I ▁do ▁for ▁you ? ▁Thank ▁you ! </s> </s> xxtrn ▁3 xxsdr ▁customer ▁The ▁Uni com ▁Internet ▁access ▁in ▁Z ha oq ing ▁Nur sing ▁School ▁can ' t ▁be ▁connected . ▁What ▁is ▁wrong ▁with ▁it ? ▁You ▁have ▁repair ed ▁it ▁for ▁the ▁whole ▁afternoon ▁in ▁the ▁area . ▁What ▁are ▁you ▁doing ▁on ▁earth ? ▁Shi t ! ▁Why ▁can ▁the ▁China ▁Mobile ▁service ▁hot line ▁be ▁got ▁through ? ▁Shi t ! ▁The ▁service ▁hot line ▁can ' t ▁be ▁got ▁through ▁the ▁whole ▁morning . ▁ 651 ▁I ▁bought ▁a ▁watch ▁last ▁year ▁and ▁the ▁service ▁hot line ▁can ' t ▁be ▁got ▁through ▁within ▁24 ▁hours . ▁I ▁won ' t ▁for give ▁you ! ▁No ▁phone ▁call ▁is ▁answered ! </s>

Optimization

Thanks to the great works of HuggingFace, fastai, and blurr, a stable fine-tuning scheme enables us to rapidly trial-and-error for a sufficiently good combination of hyper-parameters. For instance, the core steps for fine-tuning a model of ND can be as short as this:

dls = ... # fastai's Dataloaders
lrnr = Learner(
  dls,
  HF_BaseModelWrapper(hf_model), # blurr's HuggingFace model wrapper 
  opt_func=partial(SOME_OPTIMIZER, decouple_wd=True),
  loss_func=LabelSmoothingCrossEntropyFlat(),
  metrics=[
    accuracy,
    partial(top_k_accuracy, k=2),
    F1Score(average='weighted'),
    MatthewsCorrCoef(),
    ...
  ],
  cbs=[HF_BaseModelCallback],
  splitter=hf_splitter,
  path=DATA_DIR,
).to_fp16()
lrnr.create_opt()
for ...:
  # iteratively decrease base_lr and/or factor
  lrnr.fit_one_cycle(n_epoch, lr_max=slice(base_lr/factor, base_lr))

Admittedly, there are many moving parts of this fine-tuning scheme. After all, the most time consuming step for fine-tuning is Grad Student Algorithm (a.k.a. Grad Student Descent), i.e., figure out a nice combination of magical numbers, a stable optimizer, a reasonable loss function, and other techniques such as discriminative training and mixed precision. Fortunately, with the help of slanted triangular learning rates, it only takes minutes to finish each fit_one_cycle(…).

One particular choice we realize is that no need to do gradual unfreezing. It works pretty well with AWD-LSTM, but somewhat insignificant for fine-tuning Transformers.

That's it?

Yes, mostly. In our experiences, although there are more techniques to explore, the bottom line is that, unless we discover a substantially better architecture (for classification) and/or an alternative modeling perspective (that is not simply classification), a good beginning (of tokenization and optimization) almost assures success.

While the official report and datasets won't be published until the end of year 2020 [2], STC-3 datasets are available for anyone wants to give it a shot: https://sakai-lab.github.io/stc3-dataset/ [1].


REFERENCES
  1. Zeng, Kato, and Sakai: Overview of the NTCIR-14 Short Text Conversation Task: Dialogue Quality and Nugget Detection Subtasks, Proceedings of NTCIR-15, 2019.
  2. Zeng et al. (TBA): Overview of the NTCIR-15 Dialogue Evaluation Task, Proceedings of NTCIR-15, to appear, 2020.