In-depth analysis of why the meta model uses Alibaba's new distillation model. I came across some shocking news: Bloomberg reports that Meta's new model, Avocado, is being distilled using open weighted models such as Alibaba's Qianwen, Google Gemma, and OpenAI's GPT-OSS. Moreover, this model is a closed-source business model. Also, Llama is most likely dead; Zuckerberg abandoned the project. Avocado is expected to be released in January next year. Let me provide some analysis from a professional perspective: Why use three models as teacher models instead of just one? This decision is actually quite practical. During the distillation process, multiple teacher models can be used to see which one provides the best answer, thus guiding the student models. For example, the open-weight model in Qianwen has a wide variety of models, and both Chinese language proficiency and programming ability are good at the same scale. Therefore, Qianwen is used in multimodal or programming + Chinese domains, while the remaining two models are used in other domains. Furthermore, we can actually glean a great deal of information from this report. The original text mentions "distilling from rival models including Google's Gemma, OpenAI's gpt-oss, and Qwen," which strongly suggests that Avocado has already entered post-training. Distillation can be divided into black-box distillation and intermediate-layer distillation. Intermediate-layer distillation requires dimensional projection, which means that the architecture of the student model must imitate the teacher model. If this is done, it is actually "cloning the model". The original text said that three open-weight models were used, and their architectures are different, so intermediate-layer distillation cannot be achieved. Therefore, it is highly likely to be a high-level strategy in the post-training stage, and the base model of Avocado is already ready. Instead of synthesizing its own data for post-training, Meta uses open-weight model distillation, which means it severely lacks "domain-specific high-quality" data (especially data on logical reasoning, code, and complex instruction adherence). Considering that Meta is likely one of the companies with the world's largest datasets (billions of chat logs and posts), this is precisely its weakness: The data on Facebook and Instagram is full of colloquialisms, abbreviations, emotional outbursts, and short texts. This data is extremely useful for teaching models to "speak like humans," but it does almost nothing for teaching models to "think like engineers" (Reasoning/Coding), and is essentially just noise. People might even recall the paper from October this year, "LLMs Can Get 'Brain Rot'!", which argued that training large models with social media data can render them "brain-dead." Considering that the TBD (Product) team's role differs from the FAIR (Research) team, they desperately need to prove themselves commercially. Therefore, for them, saving face (using competitor model distillation) is unimportant; usability and rapid deployment are paramount, and even giving Zuckerberg an explanation is more important. In summary, this report downplays this section, but the information it reveals includes: 1. Avocado has entered post-training. The base model architecture is uncertain, but it is definitely different from Qianwen, Gemma, and GPT-OSS; it is Meta's own architecture. 2. Meta is severely lacking in high-quality domain-specific data (especially data on logical reasoning, code, and complex instruction adherence). 3. The team was under a lot of pressure, so much so that they resorted to this method for post-training. They didn't even use these models to synthesize data for training; instead, they directly "copied the answers" and performed distillation. 4. Meta relies on the "Distillation of a Thousand Questions" series to improve its logic and coding skills. Isn't this a reverse "official certification" of the value of Alibaba's "Thousand Questions" series? Hahaha #meta #AliQianwen #qwen #Avocado #llama
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.
