Logo TACO

Learning Multi-modal Action Models
with Synthetic Chains-of-Thought-and-Action

1University of Washington 2 Salesforce AI Research

Figure 1. TACO outputs a Chain-of-Thought-and-Action (CoTA) and answers challenging questions based on the thoughts and action outputs, whereas existing multi-modal large language models can only output direct answers and often fail to reach the correct answers.

Abstract

We present TACO, a family of multi-modal large action models designed to improve performance on such complex, multi-step and multi-modal tasks. During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. To train TACO, we create a large dataset of 1M+ synthetic CoTA traces generated with GPT-4o and Python programs. We then experiment with various data filtering and mixing techniques and obtain a final subset of 293K high-quality CoTA examples. This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruct tuning data with only direct answers. Our model TACO outperforms the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains up to 15% in MMVet tasks involving OCR, mathematical reasoning and spatial reasoning. Training on high-quality CoTA traces sets a new standard for complex multi-modal reasoning, highlighting the need for structured, multi-step instruction tuning in advancing open-source mutli-modal models' capabilities.

TACO Successes and Failures

CoTA Dataset

CoTA Data Generation

dataset generation method

Figure 2. We illustrate our model-based data generation (top) and programmatic generation (bottom) pipelines.

In model-based generation, we take existing image and QA pairs as inputs and prompt GPT-4o to Generate either a chain-of-thought-and-action (CoTA) or chain-of-thought (CoT) without actions to answer the questions. Then, we Verify that the chains lead to correct final answers and Parse successfully; if not, we convert them into the direct answer (Direct) format with groundtruth answers.

In programmatic generation, we first Annotate images with human labelers or models, and then use the dense annotations to fill in manually written templates and Generate QA and the corresponding CoTA with Python programs.

CoTA Data Distribution

data distribution

Figure 3. We visualize the frequency of data formats (i.e. CoTA-pos/neg, and CoT-pos/neg) in the original GPT-4-generated data and in our final training data (i.e. CoTA, CoT, or Direct) in each dataset across all data sources. We also highlight the Action-useless (i.e. datasets where % of CoT-pos - CoTA-pos > 10 or % of CoTA-neg - CoTA-pos > 10) vs. Action-useful datasets.

Experimental Results

We perform extensive experiments with 3 open-source multi-modal models and 9 data recipes on 8 benchmarks to study the effectiveness of CoTA data compared to instruction-tuning data with only direct answers, and to in- vestigate whether data filtering and programmatic data can lead to further performance gains. We highlight four main takeaways below:

data distribution

Table 1. CoTA Inference Before vs. After Fine-tuning

While GPT-4o performs well with either a direct answer (Direct) or chain-of- thought-and-action (CoTA) prompt, open-source multi-modal models lag behind and fail to generate CoTA with few-shot prompting.

Takeaway 1: We show that fine-tuning with CoTA data elicits multi-modal language models' reasoning and action calling abilities and significantly boosts their performance, which few-shot prompting fails to achieve.

results of best cota data recipe

Table 2. Best CoTA Recipe

Takeaway 2: Our best CoTA data recipe results in a strong multi-modal action model TACO that consistently beats instruction-tuned baselines by 1-4% on average across 8 benchmarks, with significant gains of up to 15% on MMVet.

model-generated data ablations

Table 3. Model-generated Data Ablations

Takeaway 3: Quality matters more than quantity: the smallest dataset with only CoTA examples results in better average performance and higher gains compared to larger datasets with a mix of CoTA, CoT and/or Direct examples; and filtering out Action-useless datasets also leads to performance gains.

model + program data mixtures

Table 4. Model-generated + Program-generated CoTA Mixtures

Takeaway 4: Adding programmatically generated data can lead to further gains on some benchmarks but brings no additional gains to the average performance.

Citation

@misc{ma2024tacolearningmultimodalaction,
      title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action}, 
      author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
      year={2024},
      eprint={2412.05479},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05479}, 
}