Skip to content

Step 4: Identify NLP-related Issues and Solutions

The Botium Coach Dashboard visualizes the NLP performance metrics and suggests steps for improving it. It will show any pieces of test data that either did not return the expected intent, did return the expected intent but with a low confidence score, or did return the expected intent, but with a confidence score close to another intent’s.


By default, Botium Coach gives a penalty to all user examples not predicted as expected by assigning a confidence score of 0.0 to those user examples. For the beginning you should skip this penalty by activating the corresponding switch in the Botium Coach Dashboard Settings.

1st Glance: The Attention Box

The Attention Box shows any alarming events Botium Coach was able to identify:

  • Predicted intent doesn’t match the expected intent

  • Entities have not been recognized

  • Test data is not suitable for analyzing with Botium Coach

Clicking on the message shows the detailed records Botium Coach identified as source of the problems.


Issues with the CORRECTNESS of the test results will be visualized here.

In this case, there are 10 user examples for which Rasa predicted an intent other than expected. In most cases this means that the training data for the NLU engine has to be refined further by adding more user examples to the expected intent, and maybe removing similar user examples from the incorrectly predicted intent (see Step 5: Annotate and Augment Training Dataset).

2nd Glance: The Intent Confidence Distribution Chart

This histogram tells us that we have some poor-performing user examples in our test session - meaning the NLP engine returned a low confidence score for them.


Issues with the CONFIDENCE of the test results will be visualized here.

Click on the left-most pile on the chart to see the poor-performing user examples, the expected and the predicted intent as well as the confidence score.


A low confidence score usually means that the NLP engine was not able to properly predict the intent for a user example, often resulting in the infamous Sorry, I don’t understand response.

Read here to know more: Intent Confidence Distribution

3rd Glance: The Top 10 Intent Confidence Risks

The radar chart tells us we have several poor-performing intents - the average confidence score is rather low. Click on one of the intents in the chart to see why it performs so poorly.

Issues with the CONFIDENCE of the test results will be visualized here.

You can now see a list of user examples and the (poor) confidence score returned by the NLP engine.

Read here to know more: Intent Confidence Risks

4th Glance: Confusion Matrix and Confidence Threshold Chart

A Confusion Matrix shows an overview of the predicted intent vs the expected intent. It answers questions like When sending user example X, I expect the NLU to predict intent Y, what did it actually predict ?.
The expected intents are shown as rows, the predicted intents are shown as columns. User examples are sent to the NLU engine, and the cell value for the expected intent row and the predicted intent column is increased by 1. So whenever predicted and expected intent is a match, the cell value in the diagonal is increased — these are our successful test cases. All other cell values not on the diagonal are our failed test cases.

The most used statistical measures of NLU performance are precision and recall:

  • The question answered by the precision score is: How many predictions of an intent are correct ?

  • The question answered by the recall rate is: How many intents are correctly predicted ?

Read here to know more: Confusion Matrix / Precision / Recall / F1-Score

The confidence threshold is the lowest accepted confidence score. If the NLP engine is not sure enough at classifying an intent (its confidence score is below confidence threshold) then it will answer with incomprehension intent to show that it doesn’t understand. This chart helps in finding the best confidence threshold for your use case - it visualizes the balance between precision and recall score, and depending on your use case the one or the other may have priority.

Read here to know more: Confidence Threshold

5th Glance: The Botium Coach Suggestions

Botium Coach will detect any issues with the test results and suggest actions which will improve the overall NLU performance. It will tell you which intents require more training data, and if test data is not suitable for performing NLU tests with it.

What else ?

Botium Coach visualizes some more useful metrics. Explore on your own!

Mismatch Probability Risks

This section shows some charts visualizing the risk that some intents will be mismatched - meaning that the NLU engine predicts the correct intent, but with a confidence score very close to another one. In real-life, a chatbot in this situation often responds with something like I am not sure what you mean - do you mean X or Y ? (In IBM Watson, this is called disambiguation).


Issues with the CLARITY of the test results will be visualized here.

Clicking in the radar chart shows the list of intents with the confidence score predicted by the NLU engine - this only works if the NLU engine actually returns an alternate intents list.

There are also charts showing the similarity of two intents based on the alternate intents lists returned by all user examples.

Read here to know more: Intent Mismatch Probability

Confidence Deviation Risks

The confidence deviation is a measure for the bandwidth of the predicted confidence score for all the user examples of an intent. It is calculated as standard deviation of the confidence scores.


Issues with the CLARITY of the test results will be visualized here.

Read here to know more: Intent Confidence Deviation

Utterance Distribution

This histogram shows the amount of utterances per predicted intent.

Read here to know more: Intent Utterance Distribution


You can download the test results as CSV, JSON and Excel for further processing. The list contains:

  • user examples

  • predicted intent and confidence score

  • extracted entities and confidence score

  • expected intent and entities