Unfortunately, human judgement is often the final word in acceptance of a dialogue system, and it is often not perfectly determined by objective measures of the system. Even when there is a relation with objective measures, it can be irrational. For example, a system that exhibits frequent speech recognition errors due to the use of a freer user-initiative dialogue strategy might be perceived as of poor quality, even though the dialogue strategy improves the execution time of the system. As a result, human judgement, which is comparatively expensive, must often be used in combination with objective measures.
One attempt to relate objective measures to human judgement is the PARADISE framework . This framework uses a set of objective measures of the dialogue, such as task completion, execution time, response time, and number of errors. In user trials, a quantitative measure of performance is obtained from human judges. It is then supposed that this performance quantity is a weighted sum function of the objective measures. Using the set of user trials, a weighted sum of the objectives is equated to the judge's value. The set of weights that minimises error over the trial data is then obtained. It is interesting to look at the weights obtained, and this was done for two dialogue systems. It was found that task completion, response time, and elapsed time for the dialogue generally obtained the greatest weights. These experiments also showed that users irrationally value recognition accuracy over elapsed time, even though they ought to value only task completion and elapsed time.
Where suitable objective measures can be found for a dialogue system, automatic training of the system becomes possible without the need for human judgements. For example, a reinforcement learning system would be able to train on dialogues with users by calculating the measures at the end of each example dialogue. For a planned approach to dialogue, there is a further requirement of the objective measures that they be compositional over the plan structure, in that the measure for a plan is equal to the sum of measures for the acts in the plan. This is because a planner, in contrast to reinforcement learning, searches for plans, rather than reinforces plans that appear in the training data, and so that planner must be able to predict the value of the plans in the search space. This could be a disadvantage, but the measures given in the previous paragraph are compositional in that they are additive over the acts in the plan structure. Task completion is a function of the plan structure that is obtained at the close of the dialogue. Response time is a function of act that the system chooses at its turn. Elapsed time is the sum over the system and user acts of the time taken to execute each dialogue act. This might be easy to predict from measures such as the number of words used in the utterance that corresponds with the act, or from recorded examples of the act's use in a real dialogue.