While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at
https://ThinkSound-Project.github.io
.
Model Overview
Citation
If you find our work useful, please cite our paper:
@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
year={2025},
eprint={2506.21448},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2506.21448},
}
Runs of FunAudioLLM ThinkSound on huggingface.co
0
Total runs
0
24-hour runs
0
3-day runs
0
7-day runs
0
30-day runs
More Information About ThinkSound huggingface.co Model
ThinkSound huggingface.co is an AI model on huggingface.co that provides ThinkSound's model effect (), which can be used instantly with this FunAudioLLM ThinkSound model. huggingface.co supports a free trial of the ThinkSound model, and also provides paid use of the ThinkSound. Support call ThinkSound model through api, including Node.js, Python, http.
ThinkSound huggingface.co is an online trial and call api platform, which integrates ThinkSound's modeling effects, including api services, and provides a free online trial of ThinkSound, you can try ThinkSound online for free by clicking the link below.
FunAudioLLM ThinkSound online free url in huggingface.co:
ThinkSound is an open source model from GitHub that offers a free installation service, and any user can find ThinkSound on GitHub to install. At the same time, huggingface.co provides the effect of ThinkSound install, users can directly use ThinkSound installed effect in huggingface.co for debugging and trial. It also supports api for free installation.