An Aalternative to Expected Reward Maximization

Name
Andre Litvin
Abstract
Reinforcement learning problems can generally be described as follows. The user quantifies how good each state of some system would be according to their preferences and some agent, e.g. a robot, must choose actions that lead to states the user defined as good. More formally, for each state and action, the user picks a real-valued reward and the goal of reinforcement learning is to automatically find a strategy, called a policy, which would lead to a high reward sum. However, actions often do not determine states, but only make some states likelier than others. In this case, the policy is usually chosen by maximizing the expected reward. However, in this thesis, I prove that for every probability p < 1 and constant c > 0, there exists a reinforcement learning problem where the policy maximizing expected reward gives reward sum Z ◦ , but another policy would give reward sum Z, where P[Z > Z ◦ + c] > p. In other words, the policy maximizing expected reward can get an arbitrarily smaller reward sum with arbitrarily high probability (except 1) compared to another policy. This might not be a desirable property for a policy to have. In this thesis, I define the smoothened median of a random variable and prove that any policy that maximizes the smoothened median of the reward sum (instead of the expectation) does not have this property.
Graduation Thesis language
Estonian
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Raul Vicente, PhD
Defence year
2023
 
PDF