DeepSeek’s V3.1 AI Model Features UE8M0 FP8 Scale

Article picture

In a post on Thursday, August 21st, DeepSeek stated that the UE8M0 FP8 scale of its V3.1 AI model is "specifically designed for upcoming domestic chips," though it did not disclose specific suppliers.

Market speculation suggests the new model may support multiple Chinese AI chip brands, rather than being limited to a single one.

1756112729570531.jpg

Technical Details and Advantages of DeepSeek’s UE8M0 FP8

FP8 (8-bit floating-point format) accelerates AI training and inference by reducing precision, memory, and bandwidth usage. UE8M0, another 8-bit format, can cut memory consumption by up to 75%, improving training efficiency and lowering hardware requirements.

This architecture is custom-designed for the hardware logic of Chinese chips, enabling the model to run smoothly on domestic hardware. Currently, Chinese-designed chips that support FP8 include products from HiSilicon (Huawei), Cambricon, Muxi, and Moore Threads.

Technical Details

Meaning of UE8M0

  • U: Unsigned, suitable for scenarios where activation values are typically non-negative.

  • E8M0: All 8 bits are used for the exponent (Exponent), with 0 bits for the mantissa (Mantissa). Flexibility is achieved through implicit normalization or dynamic mantissa adjustment.

  • Dynamic Mantissa Strategy: In practical implementation, dynamic mantissa allocation may be adopted (e.g., adjusting effective mantissa bits based on exponent range), or the mantissa is preset to 1, with values ranging from 2−128 to 2127.

FP8 Scale

Refers to the factor (Scale) used to normalize values during quantization, ensuring values fit within the representable range of FP8.

  • Block-Level Scaling: Tensors are divided into fixed-size blocks (e.g., 128×128 tiles), with each block sharing a single scaling factor. Compared to tensor-level scaling, this block-level approach expands the available dynamic range by dozens of times while retaining the 8-bit width.

Advantages and Applications

1756112770859310.png

Enhanced Hardware Efficiency

  • Memory Savings: Weight memory usage is reduced by approximately 50%. For example, the weight file of a 680B model is cut from 1.3–1.5TB to around 680GB.
  • Computation Acceleration: Since UE8M0 has no mantissa or sign bits, when the processor restores data using the scaling factor, it only needs to multiply by the corresponding power of 2 (i.e., exponent shift operation), eliminating the need for floating-point multiplication, normalization, or rounding logic—shortening the clock critical path.

Suitable Matches with Chinese Domestic Chips

  • Cambricon: The 思元 590 (Siyuan 590) chip supports FP8 precision, with a 40% increase in computing density compared to the previous generation.
  • Moore Threads: The first domestic GPU manufacturer to support native FP8, based on the MUSA Compute Capability 3.1 architecture.
  • Hygon: DCU (Deep Computing Series) reduces memory usage by 30% and improves computing efficiency by 20% through FP8 optimization.

Industry Impact

  • Technological Breakthrough: DeepSeek-V3.1 is the first domestic case in China to successfully complete large language model (LLM) training using FP8, proving the feasibility of FP8 in ultra-large-scale model training.
  • Ecosystem Closure: UE8M0 FP8 helps form a complete ecosystem of "domestic AI chips – domestic open-source models – downstream applications," accelerating Chinese AI chips’ catch-up with international advanced standards.

Continued Breakthroughs in China’s Domestic AI Chip Industry

Zhitan AI, a Chinese think tank, noted on Friday (August 22nd) that both Huawei’s 910D and Cambricon’s  Siyuan 690 chips could serve as the foundation for DeepSeek’s new model.

In the past, DeepSeek’s team primarily used NVIDIA chips for model development, so shifting to Chinese AI chips may pose challenges in stability, connection speed, and software ecosystem.

Meanwhile, Huawei is actively building a complete AI hardware ecosystem to challenge NVIDIA domestically. Earlier this year, Huawei launched the CloudMatrix 384 computing system, integrating 384 Ascend 910C neural processing units (NPUs) and 192 Kunpeng server CPUs. Connected via a unified bus, it delivers ultra-high bandwidth and low latency.

Speculation about China’s next-generation AI chips has driven up the stock prices of related listed companies. On Friday, Cambricon and Hygon’s Shanghai-listed shares both rose by 20%. SMIC, mainland China’s largest wafer foundry (which also produces Huawei’s Ascend and Kirin chips), saw its Hong Kong-listed shares rise by 10.1%, closing at HK$56.90.

Leave a comment

Comment

    No comments yet

©Copyright 2013-2025 ICGOODFIND (Shenzhen) Electronics Technology Co., Ltd.

Scroll