@bookwormengr: https://x.com/bookwormengr/status/2072421710692028900

X AI KOLs Following 07/01/26, 08:46 PM News

huawei meituan longcat-2.0 ai-ecosystem china-ai ai-factory export-controls

Summary

An analysis of how Meituan's LongCat Lab trained the LongCat 2.0 model on Huawei's 910C AI chips and CloudMatrix superpods, showcasing China's AI ecosystem overcoming US export controls. The article highlights Meituan's strategic pivot to an AI-driven lifestyle super app and Huawei's innovations in AI factory infrastructure.

https://t.co/4vkjrPsyPw

Original Article

View Cached Full Text

Cached at: 07/02/26, 04:18 AM

Huawei’s AI Factory & the stupendous rise of a food delivery company’s AI lab

This is a story of how China’s AI ecosystem is working together to overcome technological limitations imposed by US export controls . This is also a story of how - a food delivery giant ended up building a leading frontier lab destined for greatness. This story also has tech deep dive.

While X community has taken due note of LongCat 2.0 model, they have vastly missed the ingenuity of LongCat Lab , as well as, that of the Huawei team supporting them. This article is an attempt to bridge the gap.

Background:

Meituna’s LongCat Lab’s LongCat 2.0 has been talk of the town since it was released this week. This is the first model publicly known to have been trained on Huawei 910Cs, most likely using CloudMatrix 384 Superpods that have 384 ASICS/NPUs each (compared to 72 of Nvidia NNVL72 racks). LongCat’s blog claims to have used 50K total ASICs. Two things stand out from their literature, as well as, that from Huawei:

LongCat team have shown they are willing to make bold architectural bets and come on the top. **1. They have mastered sparse MoE that are hard to train. 2. They have invented better transformer attention. 3. They have further developed new axis of sparsity by developing n-gram based approach that trades of ASIC FLOPs & HBM with cheap CPU DRAM. **It is intriguing as LongCat Lab belongs to China’s food delivery giant Meituan. Imagine Uber, DoorDash, Insta-cart of doing such feats! How outstanding it would be!
Huawei has shown they can find their way around export controls and innovate. They have also shown how well they understand LLM pre-training and inferencing, and have designed their Superpod architecture with **UB-Mesh for the same with memory pooling and many other innovations to make training stable against hardware failures. **Huawei doesn’t just provide AI servers, but “AI Factory” scale solutions. It would be prudent to review it at this stage and it is going to be a globally important player.

First the Story of Meituan - why build a foundation model lab at all?

Meituan - the super app

Imagine taking Uber Eats, DoorDash, Yelp, Groupon, TripAdvisor, and Lime, and melting them down into a single smartphone application. That is the everyday reality of Meituan, China’s leading lifestyle “super app”. While Western consumers hop between dozens of fragmented apps to handle separate errands, Meituan operates as a unified digital infrastructure for daily living. Within its single interface, a user can order a 30-minute lunch delivery, unlock a commuter bike, buy movie tickets, book a hotel stay, and secure a group discount for a hair salon. By leveraging ultra-high-frequency interactions like food delivery to anchor user habits, the tech giant cross-sells higher-margin lifestyle services seamlessly. It handles millions of daily transactions and acts as a primary digital gatekeeper for physical commerce.

What truly secures Meituan’s dominance as a super app is its aggressive transition into an AI-driven “lifestyle gateway,” powered by its massive, proprietary LongCat large language model. While Western tech companies use chatbots primarily for text generation or coding, Meituan leverages conversational AI agents, like its native assistant **Xiaomei, **to fundamentally reshape how consumers interact with the physical world. Instead of manually typing “ramen” and scrolling through pages of listings, a user simply tells the AI agent their budget, location, and plan. The AI then instantly cross-references the request against Meituan’s database of over 700 million real-time merchant inventories and 1.3 billion user reviews to book tables or order deliveries autonomously. Faced with narrowing profit margins due to fierce domestic e-commerce competition, Meituan is investing billions into applied commerce AI to act as an indispensable infrastructure layer, not just processing clicks, but dynamically executing daily real-world decisions.

AI is not optional for Meituan, it is essential for its survival. They have exactly same logic as META to own their LLM layer. It is too important to be left to other companies to provide. Expect them to keep releasing kicka** models and research papers.

‘Light Years Beyond’ to ‘LongCat Lab’

Meituan’s LongCat lab (LongCat AI) is the ambitious corporate research initiative established by Meituan’s co-founder and CEO Wang Xing. The lab was heavily formed through Meituan’s mid-2023 acquisition of Light Years Beyond (Guangnian Zhi Wai), an AI startup originally founded by Meituan co-founder Wang Huiwen.

There is a bit of heartwarming story there (quoting @kevinsxu): “In early 2023, soon after ChatGPT launched Wang Huiwen (Meituan co-founder) launched his own lab to build** “China’s OpenAI”. He threw in $50 mil of his own money**, attracted investments from other Meituan co-founders including CEO Wang Xing, and built a decent team**. **You can say he started the wave of Chinese AI labs looking to compete with OpenAI, Anthropic, that now includes DeepSeek, Moonshot, Z, etc. The lab was called ‘**Light Year Beyond’. **It didn’t last long unfortunately. By June 2023, Wang was suffering from mental health issues, the pressure was too much, and Meituan bailed him out and acquired it.“ (Sources : Kevin’s tweet, SCMP)

Today Wang Huiwen would be pretty pleased with how his foundation turned out. Sure, the team working on LongCat now is not the exact same team, but he led the foundation that led to building of LongCat 2.0! A true visionary!

Source: SCMP

Does Meituan have money to sustain?

Meituan is a publicly traded company listed on the Hong Kong Stock Exchange under the ticker symbol 3690. It operates with a market capitalization of approximately $53.5 billion to $65 billion USD**, and generates a trailing twelve-month (TTM) revenue of roughly **$ 50 billion USD.

While the revenue seems huge, the company is currently unprofitable on a trailing twelve-month net basis. While its core food delivery and in-store operations are established and generate high transaction volumes, heavy spending on subsidies to fend off competitors (such as Alibaba and JD.com) and losses from expanding into new markets (such as the Middle East under the brand Keeta) have squeezed its net margins. That said this a company has massive cash flows and technology skills and they know how to raise. These are big boys of China’s tech industry.

The era of Trillion parameter models

LongCat 2.0 is a 1.6 trillion parameter models. For comparison, GLM 5.2 is half its size at near 700 billion. Training large models gets increasingly harder. - and this lab from the food delivery vendor Meituan have shown that they are more than capable of doing that. Not just that this model also is pretty good based on the benchmark numbers, including 70.8% score on Terminal bench 2.!

Source: LongCat

To be sure, Longcat team is not the first one in China to reach 1T mark. DeepSeek V4 Pro is 1.6T; Moonshot team is also rumoured to have 1T parameter model based on linear attention. So China based labs know how to train massive models. **What is new is we now have model that is fully trained from scratch on China’s domestic chips Ascend 910C - and in doing so the lab as well as the hardware vendor have shown lot of innovative thinking. **Longcat team also have taken bold architectural bets like n-gram and has come on top - we will discuss more about it.

Geo-Technological Implications

As per SemiAnalysis, Huawei can make up to 1.6 million Ascend 910C chips before they run out of stockpiled components. Quite a few to help reach AGI given Longcat 2.0 was trained with only 50K of these chips!

**Also, SMIC of China has been manufacturing new dies for Huawei based on 7nm process node. So they may not run out logic chips ever. HBM is still a bottleneck, but 1.6 million chips are quite a lot to reach AGI. **CXMT is also working aggressively to manufacture at least HBM3.

**At the same time labs like DeepSeek are singularly focussed on reducing HBM and HBM bandwidth requirement. **Their techniques like MLA, DSA, CSA, HCA are adopted by wider Chinese labs ecosystem, including LongCat Lab. In past I have covered those technique in depth in this essay: DeepSeek’s 10 trillion USD strategy

Longcat Lab uses DeepSeek’s MLA and a variant of DSA that is arguably more efficient. Likely inspired by DeepSeek’s Engram approach **(that trades of CPU memory for ASIC compute & HBRAM), **they also add an N-gram Embedding module that expands the embedding space by roughly 100× through N-gram token combinations, capturing richer local context and strengthening token-level representations. **This expends more CPU attached memory but saves on ASIC compute and HBM Bandwidth - both of which are at a premium for Chinese ecosystem. **To be fair to LongCat team - their paper LongCat Flash was published within 2 weeks of DeepSeek’s Engram and their approach has key variations. So they probably invented it independently. But idea is the same - trade off bit of DRAM memory for saving more expensive resources: ASIC FLOPs and HBM bandwidth. I highly recommend you to review LongCat Flash paper as well as DeepSeek’s Engram paper. I have included key diagrams from LongCat and DeepSeek below.

Source: Longcat

Source: DeepSeek, showing how Engram embeddings can be store and used. Compute communication overlap can used smartly for the same.

On their end, **Huawei also implements memory pooling to make DRAM memory usage efficient for critical use cases like model loading and KV Caching. **

What is more, Huawei has designed the **system to be fault resistant. Huawei’s approach is basically large number of low performance components working in parallel. **Huawei is well aware that having large number of components for a given system capacity can interrupt training more frequently as some of the components are bound to fail, and their system has far more components compared to Nvidia’s equivalent.

Nvidia Blackwell Quick Review

Before, we look deeper into Ascend 910C it is useful to review Nvidia’s black-well. Blackwell provides most training and inference flops across the world. It comes in two variants B200 and B300 (ultra). The main difference in B200 and B300 is HBM (192GB per GPU vs 288GB) and FP4 FLOPs (10PF vs 15PF) and scale out bandwidth (400 Gbps vs 800 Gbps). Everything else is pretty much the same. Note that there are two dies per GPU.

Below diagram lists various critical bandwidth metrics. Note them carefully, as we will compare it with Ascend 910C.

Source: Nvidia

Huawei Ascend 910C in comparison to Blackwell

Huawei 910C consists of two dies as well, just like Blackwell. It is the next version of 910B which had only one die. Nothing special / different here.

Source: Huawei Central

Where is the major difference is in various bandwidths as can be seen from the below diagram (Source: Huawei: Serving Large Language Models on Huawei CloudMatrix384**)**. Nvidia typically quotes bidirectional bandwidth when the industry standard is unidirectional bandwidth reporting (“Jensen Math”). Hence I converted the Nvidia numbers to unidirectional. Also, Nvidia reports bandwidth for the both dies together. I have split it for 1 to 1 comparison with Ascend dies. The number in Red are for Blackwell GPUs.

Source: Huawei (the red color numbers are added by me that represent numbers for Blackwell)

**Die to Die bandwidth: **While other bandwidths are on similar order of magnitude, I can’t help but notice that die to die bandwidth for 910C is pretty low (270GB/s compared to 5TB/s for Blackwells).

CPU-GPU bandwidth: Also, there is no dedicated NVLink CC type dedicated network for communication between GPUs and CPUs. This type of connectivity exists from Grace-Hopper generation. For both Grace-Hopper and Grace-Blackwell is 900GB/s (450GB/s unidirectional) and for Vera-Rubin it is 1800GB/s (900GB/s unidirectional). Below is a diagram showing it for Vera Rubin (sorry did not get similar diagram for Grace Blackwell).

Source: Nvidia.

In case of Huawei servers - to provide high bandwidth between CPUs and GPUs - CPUs and GPUs are connected using UB Switches and UB protocol. Huawei uses its UnifiedBus (UB) and UB-Mesh protocols as a universal fabric to scale up and connect CPUs, GPUs, and NPUs. Designed to replace legacy interconnects like PCIe, NVLink, and TCP/IP, the protocol pools massive arrays of processors into a single logical “SuperNode” memory space. In the below diagram Ascend 910C are referred to as ‘NPUs’. It shows a single node of CloudMatrix 384 - there are 48 such nodes with 8 NPUs each, thus total count is 384. There are 4 CPUs per node, so there are total 192CPUs in a Superpod.

Source: Huawei

Nvidia’s innovation: Rack-scale systems

NVL72 is a rack-scale system that has 18 compute trays, each with 2 CPUs and 4 GPUs, so total 72GPUs. These 72GPUs are connected to one another using NVLink and NVSwitchs (Switches are placed at the center of the racks). This allows those 72GPUs work like a single massive GPU.** Any GPU can access any other GPUs HBM**. The NVLink interconnect is based on copper (cheap) and high bandwidth. This is brilliance of Nvidia’s design:

Source: HPE

Huawei’s answer: Multi-rack system CloudMatrix 384

Grace Blackwell NVL72 racks of Nvidia put 72GPUs in a single scale up domain where any GPU can talk to any GPU and access each other’s HBM memory with ease at extremely high bandwidth (900GB/s for Grace Blackwell and 1800GB/s for Vera Rubin unidirectional). Each Huawei 910C has lower FLOPs, HBM and various bandwidths. Despite that Huawei’s multi-rack system CloudMatrix 384 manages to beat NVL72 on capacity: FLOPs and Memory and various bandwidths. How?

By putting 384 NPUs in a single scape up domain. It is a gigantic feat. This is where Huawei’s expertise as a gigantic networking company helps.

Quote from Huawei: “A defining feature of CloudMatrix384 is its peer-to-peer, fully interconnected, ultra-high-bandwidth network that links all NPUs and CPUs via the UB protocol. CloudMatrix384’s UB design is a precursor to the UB-Mesh. Each of the 384 NPUs and 192 CPUs connects through UB switches, enabling inter-node communication performance that closely approximates intranode levels. **The inter-node bandwidth degradation is under 3%, and inter-node latency increase is less than 1 µs. Given that modern AI workloads are predominantly bandwidth-intensive rather than latency-sensitive, this marginal latency overhead has a negligible impact on the end-to-end performance of AI tasks. **Overall, this design allows CloudMatrix384 to function as a tightly-coupled, large-scale logical node with globally addressable compute and memory, facilitating unified resource pooling and efficient workload orchestration.” (Source: Serving Large Language Models on Huawei CloudMatrix384)

So while Nvidia creates a giant server out of 72 GPUs of NVL72, Huawei manages to do it with 384 Ascend chips with help of UB protocol.

Source: Huawei

The Superpod looks rather large as it has multiple racks! What are the downsides? There is no free lunch in this world. Superpod requires much more power compared to NVL72 and its scale up networking is much more expensive. But the good thing is that China has lot of power, and the supply chain for Superpod is domestic from China’s perspective.

Source: Huawei

Huawei AI Factory

Huawei doesn’t just provide Superpods, but they provide complete data center scale designs with reference architecture and networking solutions. Multiple of Superpods can be connected into a gigantic network of up to 165K NPUs! Recall Longcat team used just 50K NPUs.

Source: Huawei

Copper or Optics for CloudMatrix 384?

A quick visual comparison with NVL72. It is a single slim rack with copper backplane connectivity for scale-up (NVLink, connected with NVSwitches). To be sure there is lot of copper (to be precise worth 100K plus USD). Look at the image. In Nvidia AI factory concept, copper is used for ‘scale up’ and optics is used for ‘scale-out’. Huawei uses optics for both!

Source: Nvidia

How are 384 NPUs connected with one another then for scale-up? NPUs within a given node can talk to each other via UB Switches and copper based connectivity, but t**o talk to other NPUs within a Superpod, they have to use optical connectivity as distances are longer than 2m, that copper can not handle. **

SemiAnalysis believe they are using short range (SR8) optical transreceivers for such scale up networking. The other option is using LPO (Linear Pluggable optics). SemiAnalysis provides a great cost comparison for networking options (highly recommended to be a subscriber).

Source: SemiAnalysis

In comparison to NVL72, CloudMatrix 384 takes much more space as well (obviously as it is many racks next to one another) and requires more expense on networking and also on power.

Chip Level and system level Comparison between Nvidia and Huawei

This table from SemiAnalysis provides a great comparison between NVL72 based on GB200 and CloudMatrix 384 based on Ascend 910C.

Quote from SemiAnalysis: “A full CloudMatrix system can now deliver 300 PFLOPs of dense BF16 compute, almost double that of the GB200 NVL72. With more than 3.6x aggregate memory capacity and 2.1x more memory bandwidth, Huawei and China now have AI system capabilities that can beat Nvidia’s.

What’s more, is the CM384 is uniquely suited to China’s strengths, which is domestic networking production, infrastructure software to prevent network failures, and with further yield improvements, an ability to scale up to even larger domains.

The drawback here is that it takes 4.1x the power of a GB200 NVL72, with 2.5x worse power per FLOP, 1.9x worse power per TB/s memory bandwidth, and 1.2x worse power per TB HBM memory capacity.“

Source: SemiAnalysis

UB-Mesh: Huawei’s answer to NVLink

I bet we are going to hear a lot about UB-Mesh in coming days. UB-Mesh (built on Unified Bus, or UB) originated as Huawei’s proprietary technology, but Huawei has since open-sourced the specification. It is currently in a transition phase: while it is open-source and free to license, it is not yet an industry standard recognized by global bodies like the IEEE or IETF.

NVLink simplified networking a lot for Nvidia GPUs (they later developed NVLink-CC that works between Nvidia CPUs and GPUs). With UB-Mesh, Huawei is going a step further by unifying network between GPUs and CPUs as well.

UB-Mesh is explicitly designed to replace the PCIe/NVLink/IB “hybrid” stack with one unified, memory-semantic (load/store/atomic) fabric, and its ambition is specifically to extend NVLink-like peer memory access beyond a single scale-up node to a much larger mesh domain.

The UB’s peer-to-peer communication capabilities enable efficient pooling of hardware resources, including DDR DRAM, CPUs, NPUs and NICs. For instance, CPUs and NPUs are pooled via UB interconnects to enhance resource utilisation.

Source: Huawei

Different types of physical links proposed by UB-Mesh based on Distance:

Source: Huawei

Every KV Cacher’s Dream: Memory Pooling

**Memory pooling in CloudMatrix384 works by aggregating the DRAM attached to all 192 Kunpeng CPUs across the supernode into a single, shared, high-performance memory pool, rather than leaving each CPU’s memory siloed to its own node.

**This pool is made accessible to every one of the 384 Ascend 910 NPUs, whether they’re doing prefill or decode work, over the ultra-high-bandwidth Unified Bus (UB) network, with uniform bandwidth and latency regardless of where the data physically sits. This is a deliberate departure from conventional “KVCache-centric” designs, where a request has to be routed to the specific node holding its cached KV data because remote access is too slow; here, any NPU can pull from the shared pool directly, which decouples request scheduling from data locality entirely.

The practical payoff is threefold: it eliminates the under-utilised, siloed DRAM problem that plagues legacy architectures, it simplifies scheduling since requests no longer need cache-aware routing, and it raises overall cache hit rates and resilience under bursty or uneven workloads.

This pooled memory substrate is what powers services like **Huawei’s elastic memory service (EMS), **which specifically accelerates KV cache reuse, model weight loading, and checkpointing by giving them memory-class bandwidth and latency instead of falling back to slower, disk-bound I/O paths.

Such pooling has been a dream in the Nvidia ecosystem. It is implemented using software layer e.g. SGLang HiCache utilising Mooncake as its L3 storage backend provides **distributed memory pooling. **In Huawei CloudMatrix, necessary support is available at the hardware layer to pool CPU memory across entire of CloudMatrix 384 which has 192 Kunpeng CPUs!

Source: Huawei

Dealing with component failures in CloudMatrix 384

UB-Mesh employs a N+1 high availability design: each rack consists of an additional backup NPU. In the event of unexpected failures in the NPUs within the system, the backup NPU is activated to restore functionality and ensure the uninterrupted continuation of LLM training jobs.

Additionally, the routing system facilitates rapid failure recovery in the case of link failures through a novel direct-notification technique.

Check out UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture - it greatly describes how ‘auto-healing’ is implemented. This allows Huawei to have large number of lower performance components and still deliver reliably.

(To be clear: CloudMatrix384’s UB design is a precursor to the UB-Mesh proposed in the above mentioned paper, so it may not have all the features. But Ascend 950 based upcoming system should have all these ‘auto-healing’ features.)

What does the future hold for Huawei?

Logic dies is not an issue for Huawei. SMIC is scaling production of 7nm dies for Huawei - they are designed by HiSilicon for Huawei. Same with CPUs, Networking and even Optics. While Huawei does not have IP for DSPs in comparison to Broadcom and Marvell , HiSilicon designs ‘good enough’ DSPs for them. They can also make do with LPO (Linear Pluggable Optics) for transreceivers - LPO eliminates need for DSPs. ‘Average DSP’/LPO approach works as Huawei’s focus is on providing “complete AI factories” rather than individual components. They can ‘tune’ / ‘co-design’ different components together.

Huawei’s major bottleneck is HBM. For their Ascend 910C they have 128GB older generation HBRAM with 3.2TB/s bandwidth, compared to 192GB of B200 at enormous 8TB/s bandwidth. B300 has 288GB/s at 22TB/s, Ascend 910C looks pale in comparison. They have stockpiles for only 1.6M Ascend 910Cs (source: SemiAnalysis)

That said, as we saw, the entire Chinese labs ecosystem is working on developing techniques that reduce the need for HBM and HBM bandwidth and use more of DRAM/LPDDR and SSD. DeepSeek is the clear leader and most other labs adopt or adapt DeepSeek’s innovations.

For example,

DeepSeek V4’s CSA and HCA reduces HBM memory need for KV cache by 98% compared to GPQA baseline.
DeepSeek’s DSpark reduces HBM Bandwidth by 66% for Decode stage which is memory bandwidth hungry.
DeepSeek’s Engram/LongCat’s n-gram trades CPU memory for FLOPs and HBM RAM.

So HBM is not going to hold Chinese ecosystem back for long. CXMT is also working on HBM3. While yield may be low for now, given their track record, one can fully trust CXMT to overcome those challenges.

China now has home grown giants for DRAM/LPDDR and SSDs - those are CXMT and YMTC respectively. Even if they are not at the frontier of their respective technologies, they will be able to support China’s AI ecosystem as they have already reached ‘good enough’ capabilities and scale, and they are not stopping soon. SemiAnalysis has done excellent analysis on emergence of CXMT as a memory giant. It is clear, CXMT and YMTC may not help China build data centers at the same scale as the United States, but they can produce globally significant amounts of memory.

DRAM global market and CXMT’s share:

**NAND global market and YMTC’s share: **

Source: Counterpoint

Further, **what Huawei can always count on is ingenuity of Chinese labs for making do with what they have and work with less HBM and HBM bandwidth. **Chinese labs like ZAI have further demonstrated their models like GLM 5.2 are RSI (recurrent self improvement) capable. DeepSeek, MoonShot, LongCat etc. will arrive there in couple of months. So entire AI ecosystem of China has reached what is often called as ‘the escape velocity’.

What does the future hold for Longcat Lab?

Behaviour of Chinese companies baffles westerners. If Uber, DoorDash or Insta-cart are not building AI (LLM) models, what is the need for their Chinese equivalents to build AI (LLM) models? If Ford and GM don’t build AI (LLM) models, why Xiomi is building models? - saving Xiomi AI story for another blog.

Here is the reason: China has multiple super apps: **WeChat (Tencent), AliPay (Ant), Douyin (ByteDance), TaoBao (Alibaba) **- West hasn’t seen anything like it yet. When westerners visit China they are always impressed how cool Chinese super apps are and how convenient daily life can be with help of these apps. With help of AI, these super apps are going to be even more powerful. They will help their users run their daily lives on auto-pilot. Whoever does it wins over the most users and makes the most revenue - which could be hundreds of billions annually. Hence, Meituan treats AI & LLM tech as the core competency - it is a matter of survival- and they will march ahead with even more investment.

Meituan has many challenges, including cut throat competition with Alibaba and JD.com for food delivery bleeding billions; but AI is an area where they would not be able to compromise. How could they? All their competitors have major AI labs: **WeChat (Tencent), AliPay (Ant), Douyin (ByteDance), TaoBao (Alibaba) **and using AI to win even more business!

Meituan: AGI shall be built funded by millions of deliveries a day!

Source

**Story of Meituan by @kevinsxu: **https://x.com/kevinsxu/status/2071981791007658280
**Chinese food delivery giant Meituan’s co-founder, Wang Huiwen, quits corporate roles owing to ‘health reasons’ after starting an AI venture: **https://www.scmp.com/tech/big-tech/article/3225426/chinese-food-delivery-giant-meituans-co-founder-wang-huiwen-quits-corporate-roles-owing-health
**China’s Overlooked AI Model Makers: Xiaomi, Meituan, and StepFun by Tech Buzz China: **https://techbuzzchina.substack.com/p/chinas-overlooked-ai-model-makers
**DeepSeek’s 10T USD Grand strategy: **https://x.com/bookwormengr/status/2057909493250539891?s=20
**DeepSeek Engram - Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models: **https://arxiv.org/pdf/2601.07372
**LongCat Flash (n-gram) - Scaling Embeddings Outperforms Scaling Experts in Language Models: **https://arxiv.org/pdf/2601.21204
SemiAnalysis - CloudMatrix 384: https://newsletter.semianalysis.com/p/huawei-ai-cloudmatrix-384-chinas-answer-to-nvidia-gb200-nvl72
Huawei: UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture https://arxiv.org/pdf/2503.20377
Huawei: Serving Large Language Models on Huawei CloudMatrix384: https://arxiv.org/pdf/2506.12708
**Counterpoint research - CXMT market Share **https://counterpointresearch.com/en/insights/global-dram-revenue-surges-to-near-dollar-100-billion-mark-in-q1-2026
**Counterpoint research - YMTC market share: **https://counterpointresearch.com/en/insights/nand-revenues-record-high-q1-2026-from-ai-demand
**China’s CXMT Is Set to Challenge DRAM Incumbents: **https://newsletter.semianalysis.com/p/chinas-cxmt-is-set-to-challenge-dram

@bookwormengr: https://x.com/bookwormengr/status/2072421710692028900

Huawei’s AI Factory & the stupendous rise of a food delivery company’s AI lab

Background:

First the Story of Meituan - why build a foundation model lab at all?

‘Light Years Beyond’ to ‘LongCat Lab’

Does Meituan have money to sustain?

The era of Trillion parameter models

Geo-Technological Implications

Nvidia Blackwell Quick Review

Huawei Ascend 910C in comparison to Blackwell

Nvidia’s innovation: Rack-scale systems

Huawei’s answer: Multi-rack system CloudMatrix 384

Huawei AI Factory

Copper or Optics for CloudMatrix 384?

Chip Level and system level Comparison between Nvidia and Huawei

UB-Mesh: Huawei’s answer to NVLink

Every KV Cacher’s Dream: Memory Pooling

Dealing with component failures in CloudMatrix 384

What does the future hold for Huawei?

What does the future hold for Longcat Lab?

Source

Similar Articles

Meituan unveils LongCat-2.0, China’s first trillion‑parameter AI model built on domestic chips

@svpino: Cina's AI ecosystem is different from everyone else: Every company wants to release its own SOTA model, but they are al…

@rohanpaul_ai: Opinion from a former Meta PM. And this is from Aravind Srinivas of Perplexity "China can build data centers a lot fast…

@bookwormengr: https://x.com/bookwormengr/status/2057909493250539891

@GoSailGlobal: Nathan Lambert visited all of China's top AI labs — Moonshot, Zhipu, Meituan, Xiaomi, Qwen / Ant Ling, http://01.AI — and wrote a piece titled Notes from inside Chi…

Submit Feedback

Similar Articles

Meituan unveils LongCat-2.0, China’s first trillion‑parameter AI model built on domestic chips

@svpino: Cina's AI ecosystem is different from everyone else: Every company wants to release its own SOTA model, but they are al…

@rohanpaul_ai: Opinion from a former Meta PM. And this is from Aravind Srinivas of Perplexity "China can build data centers a lot fast…

@bookwormengr: https://x.com/bookwormengr/status/2057909493250539891

@GoSailGlobal: Nathan Lambert visited all of China's top AI labs — Moonshot, Zhipu, Meituan, Xiaomi, Qwen / Ant Ling, http://01.AI — and wrote a piece titled Notes from inside Chi…