In the paper here, it says that perplexity as an automated metric is not reliable for open domain text generation tasks, but it instead uses lm-score, a model based metric to produce perplexity like values. What additional benefits does lm-score give instead of perplexity metric?