Using a policy has an advantage in that the true reward of dialogues can be used in training the system. On the other hand, by planning the dialogue, the reward obtained is estimated from the value of the goal and the costs accrued by each of the actions in the plan. However, the downside of using a policy is the cost of exploration to obtain training data. In reinforcement learning, an agent must strike a balance between exploration and exploitation. Using softmax selection , an agent can try actions that are not currently optimal, in order that examples can be collected to reinforce that action. As more and more examples are collected, the agent tends to exploit the optimal action rather than explore. Softmax selection is also useful in cutting down the complexity of the state space, which might be equivalent to the combinatorial composition of the states of many beliefs. One good example of using dialogue policies is that of Walker et al . They use the PARADISE evaluation framework to compute the utility of the dialogues.
Using reinforcement learning is an attractive approach to deciding dialogue strategies, and it is important to contrast this approach with the planned approach that will be taken in this thesis, since both can adapt themselves to users by training on their dialogues. To be competitive with reinforcement learning, two qualities are important. First, the planner should be as easy to use as a reinforcement learning system, and in the next chapter this will be shown to be the case. The second quality is its performance, that is, the quality of the dialogues produced given a certain amount of training material. While reinforcement learning is very useful for problems with limited numbers of states, which are well covered by training data, planning is useful where there are many more states, leading to the problem of making a good decision in novel situations. The arguments for and against planning and reinforcement learning in robot planning carry over to dialogue planning, especially where the dialogue is a non-routine dialogue such as a meta-level negotiation over a robot plan.
The model used for basic reinforcement learning is that of the Markov Decision Process (MDP) in which the state transition function is assumed to be deterministic. Just as in robot planning, where actions and observations are uncertain, dialogue planning must accommodate uncertainty since errors occur in the speech recognition process. For this reason, several researchers have addressed the use of Partially Observable Markov Decision Processes (POMDP) in dialogue planning. In a POMDP, actions have a probability distribution of effects, and states result in a probability distribution of observations. Since the agent does not know which state it is in, reinforcement is more difficult, and POMDPs can be difficult to train. Roy et al  , used a POMDP to deal with speech recogniser uncertainty in a speech-controlled robot, showing a significant improvement in performance when uncertainty in the belief state of the robot is accommodated. Zhang et al  address the state complexity issue by using a Bayesian Network to map several state variables into one.