Cross Validation Testing of Your Virtual Agent
Updated: Jun 14, 2020
The Cross Validation of a virtual agent essentially is testing how good the 'knowledge' of your bot is. In other words, this is going to test how your intents are neatly defined and see if there are no major overlaps in the definition of those.
This type fo testing can be mistaken with the Performance testing that we discussed in another blog (Performance testing of your Virtual Agent). But they are very different. Here we will see the difference betweek K-Folds and Monte Carlo methods.
There are two major types of cross validation testing used in this field:
K-Fold Cross validation
Monte Carlo testing
Monte Carlo Cross validation
The test sets may overlap in some folds
Known also as Repeated random sub-sampling validation, this creates multiple random splits of the dataset into training and validation data. For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. The results are then averaged over the splits. The advantage of this method (over k-fold cross validation) is that the proportion of the training/validation split is not dependent on the number of iterations (i.e., the number of partitions). The disadvantage of this method is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. In other words, validation subsets may overlap. 
K-fold Cross validation
The test sets never overlaps in each folds
The K-Fold Cross validation is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.
In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data.
I have created a Jupyter Notebook that guides you step by step on how to perform a K-fold testing on your virtual agent (Watson Assistant based).
I have talked about the life cycle of a chatbot in an event in London. Take a look at the video presentation.
Example of the output of the notebook
Confusion Matrix: it is a way to show in one visualization the results of your testing. Darker the diagonal line is, better it is! This means that the testing phrase is trigger the intent that it was suppose to trigger i.e. Actual intent = Predicted intent