Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation

Source

Core Claim

Open Bandit Dataset provides logged bandit feedback from ZOZOTOWN with actions, rewards, and propensities for off-policy evaluation.

Action-Time-Series Notes

  • It has explicit actions and propensities, but its temporal dynamics are weaker than full trajectory datasets.
  • It is best viewed as contextual action-response data rather than a rich world-model dataset.
  • It is useful for testing causal/off-policy pieces of an action-conditioned modeling stack.