LLM History Series — Presentation 09

Chinese Frontier Labs — The R1 Moment

A serious technical and human portrait of the Chinese frontier-lab ecosystem — DeepSeek, Qwen, Kimi, Zhipu, MiniMax, 01.AI, Baidu Ernie. How efficient training and open weights came naturally under export-control pressure, and why the January 2025 R1 release reordered what people think the frontier looks like.

DeepSeekQwenMoonshot/Kimi Zhipu/GLMMiniMax01.AI ErnieJan 2025 R1
Tsinghua / Peking pipeline 2022 wave of labs Export controls (Oct 2022) Efficient training era DeepSeek V3 (Dec 2024) R1 (Jan 2025)
00

What This Deck Covers

The Chinese frontier-adjacent labs as serious research institutions, not as geopolitical talking points. The technical contributions (efficient training, MoE architectures, open-weight releases), the people, the pipelines, and the constraints — particularly export controls on advanced GPUs — that have shaped how the work has been done.

01

The Setting — Chinese Tech, Compute, Export Controls

The Chinese AI ecosystem in 2026 looks unlike its US counterpart in three structural ways. Each shaped what the labs ended up building.

1. Big-tech context

Alibaba, Tencent, Baidu, ByteDance, Huawei each run sizeable AI research arms with substantial commercial revenue. Frontier-LLM work happens in both these incumbents (Qwen at Alibaba, Ernie at Baidu, Hunyuan at Tencent, Doubao at ByteDance) and in independent start-ups (DeepSeek, Moonshot, Zhipu, MiniMax, 01.AI).

2. Export-control pressure

October 2022 US Department of Commerce restricted A100/H100 sales to China. October 2023 tightened to cover the workaround H800/A800 chips. The most-capable training-class GPUs available legally in China since are H20 and others with deliberately reduced interconnect bandwidth. This pressure has been the dominant external constraint on Chinese frontier training for three years.

3. Domestic regulation

The 2023 Generative AI rules require pre-deployment safety reviews for consumer-facing models, content alignment with national-security guidelines, and disclosure obligations. The regulatory regime is much more prescriptive on output content than the US or EU equivalents, but largely permissive on training and architecture.

A useful frame

The export-control pressure is the most-discussed factor and probably overrated as a long-run constraint. The Chinese labs have responded by becoming genuinely better at efficiency — smaller training runs, more aggressive MoE, more careful curriculum, more systems-level optimisation. Several technical innovations widely credited to DeepSeek and others (Multi-head Latent Attention, optimised pipeline schedules, FP8 production training) were partly motivated by the chip constraints. The constraint produced techniques that travel.

02

DeepSeek — Liang Wenfeng & High-Flyer

DeepSeek is the most distinctive of the Chinese frontier labs because it is essentially the AI research arm of a quantitative hedge fund. High-Flyer (also written as Huanfang, 幻方) is a quantitative trading firm based in Hangzhou, founded in 2015 by Liang Wenfeng and a small group of Zhejiang University alumni.

LW

Liang Wenfeng — founder, High-Flyer & DeepSeek

Zhejiang University (CS); High-Flyer founded 2015; DeepSeek announced 2023

Chinese. Worked on quantitative trading using ML methods through the late 2010s; High-Flyer became one of the larger Chinese quant funds. The fund accumulated a substantial GPU cluster (reportedly tens of thousands of A100/H800 class GPUs) before the export controls landed, and Liang reallocated a meaningful share of that compute to AI research starting around 2021. This is the single most important fact about DeepSeek: it was an unusually well-resourced AI research start-up funded by a profitable trading business, with an existing in-house infrastructure team and unusual freedom from short-term commercial pressure. Liang himself maintains a low public profile but has given a small number of widely-circulated interviews to Chinese tech press, framing the lab's mission as pushing open-weight frontier capability.

The DeepSeek model line

DateModelWhat it added
2023DeepSeek LLM, DeepSeek CoderFirst open-weight releases; competitive but not headline-grabbing.
May 2024DeepSeek V2 (236B MoE, 21B active)First wide-attention release. Multi-head Latent Attention (MLA) saves KV-cache memory dramatically. Aggressive open-weight licensing.
Dec 2024DeepSeek V3 (671B MoE, 37B active)Frontier-quality base model. Reportedly trained for ~$5.5 M of cluster time (a number the company published; widely debated).
Jan 2025DeepSeek R1Reasoning model trained with pure RL on chains of thought. Matches OpenAI o1 on most benchmarks. Open weights, MIT-style licence.
2025R1-distill familySmaller open-weight models distilling R1's reasoning behaviour. Most-downloaded models on Hugging Face for months.
The hedge-fund-as-AI-lab structure

The unusual structural feature of DeepSeek is that it does not need outside fundraising in the way most start-ups do. High-Flyer's trading revenue funds the research; the lab's open-weight releases double as recruitment tools and as strategic public goods. Liang has framed this in interviews as a long-horizon bet on Chinese AI capability rather than as a near-term commercial play. The structure gives DeepSeek an unusual freedom to publish weights and recipes, and may be why it has done so more aggressively than many of the other Chinese labs.

03

The R1 Moment — January 2025

DeepSeek R1 was released on 20 January 2025 alongside a detailed technical report. It is one of the most important single releases in LLM history.

What was technically novel

  • R1-Zero — trained with pure reinforcement learning on the base model with rule-based reward (correctness on math/code), no supervised fine-tuning on chain-of-thought data. The RL produced long, structured reasoning behaviour spontaneously.
  • R1 — adds a small amount of supervised fine-tuning before RL for human-readable formatting.
  • Group Relative Policy Optimization (GRPO) — their RL algorithm, simpler and cheaper than PPO.
  • Open weights, open paper, open code for inference.

What was strategically novel

  • Frontier-tier reasoning capability available to anyone with hardware.
  • Detailed recipe enabling other labs to replicate within weeks.
  • Training cost an order of magnitude (or more) below US frontier estimates.
  • The Chinese frontier was suddenly visibly close to the western one.

The market reaction

NVIDIA's stock fell 17% on 27 January 2025 (the largest single-day market-cap loss in stock-market history at that point). The proximate cause was a re-evaluation of how much frontier-AI training compute was actually needed if Chinese-style efficiency was reproducible. The longer-run effect on the AI-infrastructure investment narrative is still being processed.

We did not set out to make a political statement. We saw a way to do reasoning training more efficiently, we ran the experiments, the results were good, and we published. — Liang Wenfeng, paraphrased from interviews following the R1 release. The lab's external posture has been deliberately low-key relative to the market reaction.
The honest accounting on costs

The widely-cited "$5.5 M training cost" for DeepSeek V3 is the marginal cost of the final training run, not the total capital invested. The actual lab cost (people, prior runs, infrastructure built up over years) is much higher. But on any fair accounting, DeepSeek V3 / R1 was trained at substantially lower compute cost than its US-frontier-lab counterparts, and the techniques are reproducible. That's the part that mattered.

04

Qwen / Alibaba DAMO

Alibaba's DAMO Academy ("Discovery, Adventure, Momentum and Outlook"; the name is famously a Jack Ma flourish) is the company's research arm. Its largest LLM programme is the Qwen family.

Qwen the line

  • Qwen 1.0 (Aug 2023) — first open-weight release.
  • Qwen 1.5, 2, 2.5 family (2024) — sizes from 0.5B to 110B; multimodal variants (Qwen-VL, Qwen-Audio).
  • Qwen 3 (2025) — expanded MoE line; competitive at the frontier on some benchmarks.
  • Qwen Chat / Qwen Coder / Qwen Math — specialist sub-lines.
  • Bailing/Marco-o1 reasoning variants in 2025.

Why Qwen matters

  • Most-downloaded open-weight model family on Hugging Face for much of 2024.
  • Strong multilingual coverage (especially Chinese and Asian languages where Llama is weak).
  • The dominant open-weight option for most non-Chinese applications outside the English-language frontier.
  • Backed by Alibaba Cloud's serving infrastructure for commercial customers.

Senior leadership of the Qwen programme is associate-distinguished engineer level inside Alibaba; the lab is large (low thousands of researchers in DAMO overall) but the Qwen team specifically is a few hundred. The most public face is Junyang Lin, who has been broadly visible as the team's external voice on Twitter / X.

A pattern: the big-tech labs vs the start-ups

Qwen, Hunyuan (Tencent), Doubao (ByteDance) and Ernie (Baidu) are big-tech research arms. They have effectively unlimited capital and can afford long timelines. The independent labs — DeepSeek, Moonshot, Zhipu, MiniMax, 01.AI — tend to be smaller, more research-aggressive, and more open-weight by default. The two groups serve genuinely different roles in the Chinese ecosystem, much as Microsoft Research vs OpenAI does in the US.

05

Moonshot AI / Kimi

Moonshot AI was founded in March 2023 in Beijing by Yang Zhilin, a young researcher with a Tsinghua and CMU background. Its flagship product is Kimi, a consumer-chat assistant that for much of 2024 was one of the most-used Chinese chatbots.

YZ

Yang Zhilin (杨植麟) — founder, Moonshot AI

Tsinghua (BS) → CMU (PhD, William Cohen) → Recurrent.AI → Moonshot

One of the more academically credentialed of the Chinese AI-lab founders. Co-author Transformer-XL and XLNet during his PhD at CMU. Returned to China around 2019, founded Recurrent.AI, then Moonshot in 2023. Public posture is more research-focused than most peers; gives technical talks and Q&As frequently.

Kimi specialises in long-context reading and reasoning — 200k Chinese-character context launched February 2024, 2 M-character context December 2024, ahead of the western frontier on context length for a window of months. The k1.5 reasoning model in early 2025 is competitive with R1 on math benchmarks at much smaller scale.

A point of differentiation

Among the Chinese labs, Moonshot has been the most aggressive on context-length-as-product-feature. Kimi pushed long-context models when the rest of the field was still at 8–32k. The product use cases — uploading entire textbooks or codebases for chat — landed strongly with Chinese knowledge workers, and gave Moonshot a distinctive consumer brand position.

06

Zhipu AI / GLM

Zhipu AI (智谱清言) was founded in 2019 as a spinoff from Tsinghua University's Knowledge Engineering Group, led by Tang Jie (a senior Tsinghua professor) and a team of Tsinghua PhDs.

The lab's flagship line is GLM (General Language Model), a hybrid encoder-decoder family that has been actively published since 2021. GLM-130B (released 2022) was one of the largest open-weight models from any lab anywhere at the time. ChatGLM-6B was the first widely-deployed Chinese-language consumer chat assistant. By 2025 the GLM line includes specialist variants for coding (CogCodeGeeX), vision-language (CogVLM), agents and reasoning.

TJ

Tang Jie (唐杰) — co-founder, Chief Scientist

Tsinghua University professor; co-founder Zhipu AI

Senior academic at Tsinghua's Department of Computer Science; ran the Knowledge Engineering Group that built much of China's academic-NLP toolkit before LLMs (AMiner, the academic-paper-and-citation graph, is also his work). Zhipu is in many ways an industrial spinout of an established academic group, with the cultural posture that implies — more publications, more PhD-style mentoring of junior researchers, slower commercial cadence.

A useful pattern

Zhipu sits in the same structural position in Chinese AI as Allen AI does in US AI: a research-credentialed lab with an academic cultural lineage that publishes more than its industrial peers and trains more junior researchers. Its impact is felt in the quality of the next generation of Chinese ML researchers as much as in its product-shipping cadence.

07

MiniMax, 01.AI, Baidu Ernie

The remaining major Chinese frontier-or-adjacent labs.

MiniMax

Founded 2021 by Yan Junjie, ex-SenseTime senior vision researcher. Based in Shanghai. Built consumer chatbot Talkie / Glow plus the MiniMax-01 series of MoE models. The first to ship a 4M-token context window product (early 2025). Strong on multimodal (audio, video) where many Chinese labs are weaker. Cap-table is a mix of Tencent, Alibaba and tier-1 venture investors.

01.AI

Founded 2023 by Kai-Fu Lee. Lee is one of the most-recognisable Chinese tech figures — ex-Apple, ex-Microsoft Research Asia (which he founded in 1998 and ran for years), ex-Google China president, founder of Sinovation Ventures. 01.AI's Yi model line is an open-weight family that briefly led on some Chinese-language benchmarks in 2024. Cultural posture is the most international among the Chinese labs — Lee speaks publicly in English to western audiences and has substantial cross-Pacific networks.

Baidu Ernie

The longest-running Chinese LLM programme — ERNIE 1.0 was released in 2019, before GPT-3. The ERNIE Bot consumer chatbot launched in March 2023 (a few months after ChatGPT) and has been Baidu's flagship AI product since. Architecture-wise an early advocate of knowledge-graph integration into pretraining. Currently runs on Baidu's Kunlun in-house chips for some inference workloads. Senior figures: Wang Haifeng (CTO of Baidu) is the long-serving research leader.

Tencent Hunyuan, ByteDance Doubao

Tencent's Hunyuan model line and ByteDance's Doubao are big-tech research arms with substantial in-house adoption (Hunyuan in WeChat, Doubao in Douyin/TikTok). Less internationally visible than Qwen because the products are mostly consumed by their parent companies' Chinese consumer apps, but technically serious. Doubao in particular has a large Chinese-language consumer footprint — for some periods of 2024 it was the largest-by-DAU Chinese AI chat product.

08

The Talent Pipeline — Tsinghua, Peking, USTC

The Chinese frontier-lab cohort is drawn from a strikingly concentrated pipeline. Three universities — Tsinghua, Peking and USTC (University of Science and Technology of China) — produce most of the senior research talent.

Tsinghua University, Beijing

The most concentrated computer-science programme in China. Yao Class (founded by Andrew Yao after his return from Princeton) is a famously selective undergraduate stream that has produced disproportionate numbers of senior CS researchers, including several Chinese-AI-lab founders. Tang Jie (Zhipu), Yang Zhilin (Moonshot), Wang Xiaochuan (Sogou/Baichuan), Tang Wei and many DeepSeek senior staff have Tsinghua links.

Peking University, Beijing

Strong NLP and ML faculty — Maosong Sun, Xipeng Qiu, Yunfeng Liu, Furong Peng. Slightly more theory-leaning than Tsinghua. Many DeepSeek and Qwen researchers come from PKU.

USTC, Hefei

Smaller but technically prestigious. Strong in computer-vision and quantum-computing more than NLP per se; nonetheless feeds the Chinese frontier through its alumni in Microsoft Research Asia, ByteDance and DeepMind-style trajectories.

The MSRA factor

An unusual structural fact: Microsoft Research Asia (MSRA), founded 1998 by Kai-Fu Lee and run for years from a building in Beijing's Zhongguancun district, was for two decades the most important industrial research lab in China. Most senior Chinese AI researchers under 50 either did internships, postdocs or full-time stints at MSRA before moving to Chinese big-tech or founding their own labs. The Chinese AI ecosystem is, in talent-pipeline terms, an MSRA diaspora as much as a Tsinghua/PKU one.

The senior advisors

Andrew Yao (Tsinghua, Yao Class), Kai-Fu Lee (01.AI / Sinovation), Harry Shum (ex-Microsoft EVP, Tsinghua chair), Wang Haifeng (Baidu CTO), Tang Jie (Zhipu), and a handful of senior Tsinghua/PKU faculty form a small but tightly networked senior cohort that maps roughly onto the role Hinton/Bengio/LeCun and their immediate students play in the western field. Most research lineages run through them.

09

Constraints — Chips, Data, Regulation

Chips

The H800 ban (Oct 2023) and subsequent tightening pushed Chinese labs to: (1) make better use of older A100/H100 stockpiles purchased before controls; (2) work with H20 (the deliberately downgraded current-gen chip available legally); (3) use Huawei's domestic Ascend 910B (improving but still well behind H100 on most workloads); (4) get more out of every chip via efficient training.

Data

Chinese labs have unusually rich Chinese-language training data, including Baidu Baike, the academic paper graph (AMiner), specialised corpora on Chinese law / medicine / finance. They have less coverage of niche-English-language data than US labs. Pretraining mixes typically reflect this with stronger Chinese / weaker English share than US-lab models.

Regulation

The 2023 Generative AI rules require pre-deployment safety reviews and content alignment with national-security guidelines. In practice this primarily affects post-training and consumer-facing deployment; pretraining and base models are largely unconstrained. The regulatory environment is restrictive on outputs, permissive on architecture — the opposite of where US/EU regulation is heading.

Why the chip constraint matters less than people think

The bottleneck on frontier capability is not chip count alone; it is the product of (chips) × (data) × (algorithmic efficiency). The export controls hit the first factor hard. The Chinese labs responded by getting much better at the third — algorithmic and systems efficiency — and that has compounded across releases. The pattern is consistent with how research constraints have historically driven innovation in computer architecture, microelectronics and database systems: the most efficient designs come from teams without the option of throwing money at the problem.

10

Why Open Weights Came Naturally

Almost all of the Chinese frontier labs default to open weights. This is the opposite of the US-frontier pattern. The reasons cluster:

Strategic

  • If you are not the global API-revenue leader, the marginal economics of open weights versus closed are different. Distribution via Hugging Face or Model Scope is cheap; you are not foregoing $1 B/year of API revenue.
  • Open weights are a recruitment tool. They are also a signal of seriousness to the global research community.
  • They are a public-good contribution that is read in policy circles as China-positive.

Practical

  • Open weights build a community that helps debug, fine-tune and find use cases the lab itself does not cover.
  • They make export-controlled inference less dependent on any single vendor.
  • They allow non-Chinese deployers to use the models without having to depend on Chinese cloud infrastructure (a politically meaningful detail).
  • They are how Chinese researchers participate in the global field even if their labs are not internationally hosted.
Chinese labs ship weights for the same reason Linux distributions ship source: it's cheaper than convincing the world to trust you, and the people who would benefit most from trusting you can't. — A common framing in 2025 commentary on the open-weight pattern. The exact attribution varies; the analysis has been made in multiple substacks and policy papers.
The reciprocal effect

Once DeepSeek, Qwen and a handful of others had shipped frontier-quality open weights, the western open-weight ecosystem (Meta Llama, Mistral) had cover to keep going. The pressure to "match the Chinese on openness" as a competitive feature is a meaningful additional reason Llama 4 and Mistral's recent releases have stayed open-weight. The dynamic is reciprocal in a way that pure US-policy framing tends to miss.

11

The Geopolitical Frame

The Chinese frontier-lab story sits inside a broader geopolitical context that matters for any honest read of where the field is going.

What is reasonably well-attested

  • The Chinese government considers AI capability strategically important, both economically and for national security.
  • The Chinese labs operate within Chinese law, including content-alignment requirements for consumer-facing products.
  • The export-control regime has measurably constrained the rate at which Chinese labs can scale; it has not prevented serious research.
  • The R1 release was a genuine technical achievement and not, as some early reactions suggested, a propaganda exercise. The subsequent replication and extension by labs around the world has confirmed the core technical claims.

What is genuinely contested

  • How close the Chinese frontier is to the US frontier in absolute capability. Estimates vary by 6–18 months depending on benchmark, source and observer.
  • Whether Chinese-frontier-trained models are systematically constrained on outputs in ways that affect benchmark utility versus real-world utility. The published benchmarks suggest small effects; subjective use suggests sometimes larger ones.
  • How the export-control regime evolves. Several rounds of tightening have happened; further rounds are politically possible.
  • Whether the open-weight pattern continues as Chinese labs reach commercial maturity and face the same incentives that pushed western labs closed.
The honest read

Treating the Chinese frontier-lab ecosystem as a serious set of research institutions is the right starting point for any technical observer. The output speaks for itself; the people are well-credentialed; the techniques travel. Geopolitical concerns are legitimate and are properly handled at the policy layer (export controls, deployment restrictions). They do not change what the research is.

12

Cheat Sheet

The independent labs

  • DeepSeek — High-Flyer, Hangzhou.
  • Moonshot AI / Kimi — Yang Zhilin, Beijing.
  • Zhipu AI / GLM — Tang Jie, Tsinghua.
  • MiniMax — Yan Junjie, Shanghai.
  • 01.AI / Yi — Kai-Fu Lee, Beijing.

The big-tech labs

  • Qwen — Alibaba DAMO.
  • Ernie — Baidu, Wang Haifeng.
  • Hunyuan — Tencent.
  • Doubao — ByteDance.
  • Pangu — Huawei.

Three things to remember

  • Export-control pressure forced efficiency innovations that travel.
  • Open weights are the default, for strategic and practical reasons.
  • The Tsinghua/PKU/USTC + MSRA pipeline is the talent infrastructure.

What's next in the series

  • 10 — Future Directions. The only forecast deck in the series. Will date faster than the rest.