Logo TACO

Learning Multi-modal Action Models
with Synthetic Chains-of-Thought-and-Action

1University of Washington 2 Salesforce AI Research

Abstract

We present TACO, a family of multi-modal large action models designed to improve performance on such complex, multi-step and multi-modal tasks. During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. To train TACO, we create a large dataset of 1M+ synthetic CoTA traces generated with GPT-4o and Python programs. We then experiment with various data filtering and mixing techniques and obtain a final subset of 293K high-quality CoTA examples. This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruct tuning data with only direct answers. Our model TACO outperforms the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains up to 15% in MMVet tasks involving OCR, mathematical reasoning and spatial reasoning. Training on high-quality CoTA traces sets a new standard for complex multi-modal reasoning, highlighting the need for structured, multi-step instruction tuning in advancing open-source mutli-modal models' capabilities.

Figure 1. TACO outputs a Chain-of-Thought-and-Action (CoTA) and answers challenging questions based on the thoughts and action outputs, whereas existing multi-modal large language models can only output direct answers and often fail to reach the correct answers.

Additional Examples

CoTA Dataset

CoTA Data Generation

dataset generation method

Figure 2. We illustrate our model-based data generation (top) and programmatic generation (bottom) pipelines.

In model-based generation, we take existing image and QA pairs as inputs and prompt GPT-4o to Generate either a chain-of-thought-and-action (CoTA) or chain-of-thought (CoT) without actions to answer the questions. Then, we Verify that the chains lead to correct final answers and Parse successfully; if not, we convert them into the direct answer (Direct) format with groundtruth answers.

In programmatic generation, we first Annotate images with human labelers or models, and then use the dense annotations to fill in manually written templates and Generate QA and the corresponding CoTA with Python programs.

CoTA Data Distribution

data distribution

Figure 3. We visualize the frequency of data formats (i.e. CoTA-pos/neg, and CoT-pos/neg) in the original GPT-4-generated data and in our final training data (i.e. CoTA, CoT, or Direct) in each dataset across all data sources. We also highlight the Action-useless (i.e. datasets where % of CoT-pos - CoTA-pos > 10 or % of CoTA-neg - CoTA-pos > 10) vs. Action-useful datasets.

Experimental Results

We perform extensive experiments with 3 open-source multi-modal models and 9 data recipes on 8 benchmarks to study the effectiveness of CoTA data compared to instruction-tuning data with only direct answers, and to in- vestigate whether data filtering and programmatic data can lead to further performance gains. We highlight three main takeaways below:

Takeaway 1: Fine-tuning with CoTA data elicits multi-modal language models' reasoning and action calling abilities and significantly boosts their performance (i.e. CoTA finetuned vs. CoTA in Figure 4); data distribution

Figure 4. Model's average accuracy before and after being finetuned with Direct Answers vs. CoTA data

Takeaway 2: Compared to the instruction-tuned baseline trained with only direct answers, TACO improves by 1-4% on average across all benchmarks (i.e. CoTA finetuned vs. Direct finetuned in Figure 4).

results of best cota data recipe

Table 1. Models' accuracy on all benchmarks after being finetuned with the best data recipe

data distribution

Figure 5. Model's accuracy on MMVet after being finetuned with Direct Answers vs. CoTA data

Takeaway 3: Takeaway 2 holds regardless of model backbones and checkpoints (Table 1), with significant gains of up to 20% on MMVet (Figure 5).

Citation

@misc{ma2024tacolearningmultimodalaction,
      title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action}, 
      author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
      year={2024},
      eprint={2412.05479},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05479}, 
}