Wasserstein Policy Learning for Distributional Outcomes
Offline policy learning is studied for distribution-valued outcomes, where each potential outcome is a probability measure on R and the reward is defined through a utility functional applied to the potential outcomes. The Wasserstein distance is used to define the reward, and the goal is to learn a policy that maximizes the empirical welfare defined as the mean of the scalar-valued potential outcomes.
Key takeaways
- Offline policy learning studied for distribution-valued outcomes.
- Wasserstein distance used to define reward.
- Utility functional applied to define reward.
Offline policy learning is studied for distribution-valued outcomes, where each potential outcome is a probability measure on R and the reward is defined through a utility functional applied to the potential outcomes. The Wasserstein distance is used to define the reward, and the goal is to learn a policy that maximizes the empirical welfare defined as the mean of the scalar-valued potential outcomes.
Key takeaways
- Offline policy learning studied for distribution-valued outcomes.
- Wasserstein distance used to define reward.
- Utility functional applied to define reward.