A brief introduction and a simple approach
DialEval-1 consists of two subtasks: Nugget Detection (ND) and Dialogue Quality (DQ). They aim to evaluate customer-helpdesk dialogues automatically. The ND subtask is to classify whether a customer or helpdesk turn is a nugget, where being a nugget means that the turn helps towards problem solving; and the DQ subtask is to assign quality scores to each dialogue in terms of three criteria: task accomplishment, customer satisfaction, and efficiency. The official evaluation results of 18 runs are received from 7 teams. IMTKU (Dept. of Information Management, Tamkang University, Taiwan) and I, on behalf of ZEALS, by utilizing XLM-RoBERTa via HuggingFace Transformers, fastai, and blurr, achieve top-1 or top-2 in terms of various evaluation metrics for both ND and DQ.
- NTCIR? DialEval? ND? DQ?
- What to predict, exactly?
- How do we approach the tasks?
- That's it?
NTCIR? DialEval? ND? DQ?
NII (National Institute of Informatics) Testbeds and Community for Information Access Research, a.k.a. NTCIR, is a series of evaluation workshops designed to enhance research in information access technologies, including information retrieval, question answering, text summarization, extraction, etc. It can be seen as an Asia-Pacific counterpart of TREC (Text REtrieval Conference; America), CLEF (Conference and Labs of the Evaluation Forum; Europe), and FIRE (Forum for Information Retrieval Evaluation; Southern Asia).
DialEval-1 is one of the initiatives, e.g., DSTC (Dialog System Technology Challenges), that study the complexity of the dialogue phenomenon and various dialogue-related problems. Unlike typical tasks of DSTC, however, DialEval-1 focuses on reducing the cost of corpus curation and data analytics. Although WOZ (Wizard-of-Oz), a common approach nowadays, can distribute the workloads of data collection, annotation, and evaluation to crowd-sourcing, the time spent on reaching an agreement for annotation/evaluation standard is nonetheless linear, if not worse. In other words, human annotation and evaluation have two types of issues:
- Scalability: costly and hard to decentralize;
- Measurability: likely unrepeatable/inconsistent even for the same system.
To overcome the issues, DialEval-1 proposes to assess dialogues automatically. Given a customer-helpdesk dialogue (Figure 1), for example, can a system predict which turn of the dialogue is helpful, and by how much?
Therefore, DialEval-1 continues ND and DQ subtasks of Short Text Conversation (STC-3) at NTCIR-14 , and further constructs a new test collection, such that the performance evaluation would be fair and realistic.
What to predict, exactly?
Allow me to briefly describe what the expected outcomes are, and how to evaluate them. There will be no math formulae. If the reader is interested in rigorous definitions, please kindly refer to the overview papers .
A nugget is a turn by either Helpdesk or Customer. It helps Customer transition from Current State (including Initial State) towards Target State (i.e., when the problem is solved). STC-3 and DialEval-1 organizers define the following 7 nugget types:
- CNaN / HNaN: Customer or Helpdesk's non-nuggets that are irrelevant to the problem-solving situation;
- CNUG / HNUG: Customer or Helpdesk's regular nuggets that are relevant to the problem-solving situation;
- CNUG* / HNUG*: Customer or Helpdesk's goal nuggets that confirm and provide solutions, respectively;
- CNUG0: Customer's trigger nuggets that initiate the dialogues with certain problem descriptions.
Using subjective scores to quantify the quality of a dialogue as a whole. The organizers define 3 score types:
- A-score: Accomplishment; has the problem solved? To what extent?
- S-score: Satisfaction; how satisfied Customer is with the dialogue;
- E-score: Effectiveness; how effective and efficient the dialogue is.
Each score is on a 5-point scale of ranks, ranging from -2 to 2.
Since the issues are about inconsistent human assessment, the gold standard of the datasets is not trivial classes or ranks but distributions. Roughly speaking, the distributions are votes from annotators for certain class/rank. Such that the organizers evaluate a system's prediction of class/rank by comparing how similar the predicted distribution to the gold standard's. More specifically, the metrics for ND are Root Normalized Sum of Squares (RNSS) and Jensen-Shannon Divergence (JSD). and the ones for DQ are Normalized Match Distance (NMD) and Root Symmetric Normalized Order-aware Divergence (RSNOD). Again, I will spare the readers with the math formulae, please see the papers if interested.
How do we approach the tasks?
Despite the architecture differences, almost all participants of STC-3 modeled ND and DQ as classification tasks. We adopt the same tactic for DialEval-1, and then pay more attention to tokenization and optimization. Because STC-3 showed that none of architectures from participants outperformed the baselines of BiLSTM with GloVe . Therefore our approach is simply fine-tuning the state-of-the-art pre-trained models with well-established tricks of tokenization and optimization.
Based on our preliminary trials, we find that XLM-RoBERTa works well for both Simplified Chinese and English. Although the rationale behind it hasn't been fully examined, we speculate that the sentencepiece-based unigram subwords may have been helpful. Besides that, in order to simulate the turn structure of a dialogue, we not only utilize XLM-RoBERTa's special tokens, namely BOS (beginning of sentence;
), EOS (end of sentence; ), and SEP (separator of sentences; ), but also customize several tokens in fastai style to provide some minimal context.
For example, consider a tokenized turn below:
xxlen ▁3 <s> xxtrn ▁1 xxsdr ▁customer ▁@ ▁China ▁Uni com ▁Customer ▁Service ▁in ▁Gu ang dong ▁Shi t ! ▁What ▁is ▁your ▁staff ▁service ▁doing ▁on ▁earth ? ▁I ▁have ▁called ▁the ▁staff ▁service ▁for ▁3 ▁hours , ▁but ▁no ▁one ▁answer ▁my ▁phone ▁call . ▁It ▁is ▁no ▁wonder ▁that ▁customer ▁evaluation ▁is ▁so ▁bad . ▁Shi t ! ▁I ▁am ▁at ▁Kang le ▁Middle ▁Road . </s>
xxtrn stand for length of the dialogue in turns and the position of each turn of the dialogue, respectively. The numbers right next to them provide certain features of turns. The same trick goes with
xxsdr that differentiates whether the sender is Customer or Helpdesk. When a turn's context says
xxtrn _1 xxsdr _customer, the nugget type is almost definitely CNUG0.
As for DQ, a whole dialogue can be tokenized in a similar fashion, where
xxlen could be useful for certain quality scores:
xxlen ▁3 <s> xxtrn ▁1 xxsdr ▁customer ▁@ ▁China ▁Uni com ▁Customer ▁Service ▁in ▁Gu ang dong ▁Shi t ! ▁What ▁is ▁your ▁staff ▁service ▁doing ▁on ▁earth ? ▁I ▁have ▁called ▁the ▁staff ▁service ▁for ▁3 ▁hours , ▁but ▁no ▁one ▁answer ▁my ▁phone ▁call . ▁It ▁is ▁no ▁wonder ▁that ▁customer ▁evaluation ▁is ▁so ▁bad . ▁Shi t ! ▁I ▁am ▁at ▁Kang le ▁Middle ▁Road . </s> </s> xxtrn ▁2 xxsdr ▁help desk ▁Hello ! ▁We ▁are ▁sorry ▁for ▁the ▁in con veni ence . ▁100 10 ▁is ▁our ▁service ▁hot ▁line . ▁We ▁may ▁not ▁answer ▁your ▁phone ▁call ▁during ▁the ▁busy ▁hour ▁of ▁tele traf fic . ▁We ▁sincer ely ▁ap ologi ze ▁for ▁that ! ▁What ▁can ▁I ▁do ▁for ▁you ? ▁Thank ▁you ! </s> </s> xxtrn ▁3 xxsdr ▁customer ▁The ▁Uni com ▁Internet ▁access ▁in ▁Z ha oq ing ▁Nur sing ▁School ▁can ' t ▁be ▁connected . ▁What ▁is ▁wrong ▁with ▁it ? ▁You ▁have ▁repair ed ▁it ▁for ▁the ▁whole ▁afternoon ▁in ▁the ▁area . ▁What ▁are ▁you ▁doing ▁on ▁earth ? ▁Shi t ! ▁Why ▁can ▁the ▁China ▁Mobile ▁service ▁hot line ▁be ▁got ▁through ? ▁Shi t ! ▁The ▁service ▁hot line ▁can ' t ▁be ▁got ▁through ▁the ▁whole ▁morning . ▁ 651 ▁I ▁bought ▁a ▁watch ▁last ▁year ▁and ▁the ▁service ▁hot line ▁can ' t ▁be ▁got ▁through ▁within ▁24 ▁hours . ▁I ▁won ' t ▁for give ▁you ! ▁No ▁phone ▁call ▁is ▁answered ! </s>
Thanks to the great works of HuggingFace, fastai, and blurr, a stable fine-tuning scheme enables us to rapidly trial-and-error for a sufficiently good combination of hyper-parameters. For instance, the core steps for fine-tuning a model of ND can be as short as this:
dls = ... # fastai's Dataloaders lrnr = Learner( dls, HF_BaseModelWrapper(hf_model), # blurr's HuggingFace model wrapper opt_func=partial(SOME_OPTIMIZER, decouple_wd=True), loss_func=LabelSmoothingCrossEntropyFlat(), metrics=[ accuracy, partial(top_k_accuracy, k=2), F1Score(average='weighted'), MatthewsCorrCoef(), ... ], cbs=[HF_BaseModelCallback], splitter=hf_splitter, path=DATA_DIR, ).to_fp16() lrnr.create_opt() for ...: # iteratively decrease base_lr and/or factor lrnr.fit_one_cycle(n_epoch, lr_max=slice(base_lr/factor, base_lr))
Admittedly, there are many moving parts of this fine-tuning scheme. After all, the most time consuming step for fine-tuning is Grad Student Algorithm (a.k.a. Grad Student Descent), i.e., figure out a nice combination of magical numbers, a stable optimizer, a reasonable loss function, and other techniques such as discriminative training and mixed precision. Fortunately, with the help of slanted triangular learning rates, it only takes minutes to finish each
One particular choice we realize is that no need to do gradual unfreezing. It works pretty well with AWD-LSTM, but somewhat insignificant for fine-tuning Transformers.
Yes, mostly. In our experiences, although there are more techniques to explore, the bottom line is that, unless we discover a substantially better architecture (for classification) and/or an alternative modeling perspective (that is not simply classification), a good beginning (of tokenization and optimization) almost assures success.
While the official report and datasets won't be published until the end of year 2020 , STC-3 datasets are available for anyone wants to give it a shot: https://sakai-lab.github.io/stc3-dataset/ .
- Zeng, Kato, and Sakai: Overview of the NTCIR-14 Short Text Conversation Task: Dialogue Quality and Nugget Detection Subtasks, Proceedings of NTCIR-15, 2019.
- Zeng et al. (TBA): Overview of the NTCIR-15 Dialogue Evaluation Task, Proceedings of NTCIR-15, to appear, 2020.