A free and open-source multimodal model that can "enlarge images for thinking" with only 3B activation parameters. Wenxin's multimodal thinking model has been open-sourced; the model is called ERNIE-4.5-VL-28B-A3B-Thinking. Surprisingly, the open-source version uses the Apache-2.0 license, providing complete weight and inference code, which is also suitable for commercial use. Over the past few years, the large-scale model industry has resembled an arms race: parameters have become increasingly larger, and computing power has become increasingly expensive. In fact, small models also have unique advantages: low deployment cost, fast inference speed, and more usage scenarios (such as running on mobile phones). The biggest highlight is that the open-source model has the ability to "think in images": it can actively zoom in/out on images, focus on details, and perform multi-step reasoning. In addition, because it is a multimodal model, it also supports video analysis, text extraction and other capabilities. It is said to perform very stably on tasks involving image/text/video/document comprehension and reasoning. Some official cases look quite good.
The model has been uploaded to HuggingFace, GitHhuggingface.co/baidu/ERNIE-4.…e communitgithub.com/PaddlePaddle/E…pgithub.com/PaddlePaddle/F…itHub: htaistudio.baidu.com/modelsdetail/3…https://t.co/C0lCKwbIMp PaddlePaddle Galaxy Community:





