Chain-of-thought (CoT) reasoning in vision language
models (VLMs) is crucial for improving
interpretability and trustworthiness. However,
current training recipes often relying on
datasets dominated by short annotations with
minimal rationales. In this work, we show that
training VLM on short answers leads to poor
generalization on reasoning tasks that require
more detailed explanations. To address this limitation,
we propose a two-stage post-training
strategy that extends the usage of short answer
data for enhanced CoT reasoning. First, we
augment short answers with CoT reasoning
generated by…
Source: Read MoreÂ