Memory Supercycle | AI Tokenomics and the Next Memory Repricing Opportunity

As AI inference, long-context workloads and AI agents continue to scale, memory is moving from a traditional cyclical industry into one of the most important investment themes in the next phase of AI hardware.

  • |
  • Published on 05 Jun 2026

Memory Supercycle | AI Tokenomics and the Next Memory Repricing Opportunity | Open a FREE FSMOne account and manage all your investments conveniently in ONE place
Photo by Brian on Unsplash

  • AI inference, long-context processing and AI agents are making memory bandwidth and capacity the next major hardware bottlenecks.
  • Token pricing shows that output tokens are far more expensive than input tokens, reflecting the higher memory-reading and capacity requirements during the decode stage.
  • Memory demand is expanding beyond HBM into LPDDR and NAND, creating re-rating opportunities across Nvidia, Micron, Samsung, SK Hynix, and China’s memory supply chain.

The AI investment narrative is entering its next chapter. In the last two years, the market has focused on GPU scarcity, but as inference demand surges, context windows lengthen, and the mass adoption of AI agents, the bottleneck is shifting from raw compute to memory bandwidth and capacity.

At the same time, the consensus that frontier model labs are structural cash incinerators is starting to look out of date. Anthropic's annualised revenue run rate has surged from US$9 billion at the end of 2025 to over US$30 billion by April 2026, and the company is on track for its first quarterly operating profit in Q2 2026.

Once labs earn profits, they have both the ability and the incentive to pay more for the hardware bottleneck supporting that growth. In other words, the repricing of the AI value chain is extending from the model layer to the hardware bottleneck. And the most underrated bottleneck in today's AI hardware stack is not just the GPU — it is memory.

Memory Supercycle | Investigate AI Bottleneck from the Price Tag

To see why memory has become the most important hardware bottleneck of the inference era, the cleanest starting point is the model labs' price tag. Token prices are not just a revenue line — they also reveal the underlying cost asymmetry between different inference phases.

Across the major frontier models in both the US and China, the price of an output token is dramatically higher than the price of an input token — typically by a factor of several times, or even an order of magnitude. Anthropic, for example, charges US$5 per million input tokens for Claude Opus but US$25 per million output tokens — a 5x gap. OpenAI, DeepSeek, Qwen, and Kimi all show the same asymmetric pattern on their published price cards.

Table 1: Token Pricing Asymmetry Across Major Frontier Labs

Lab — Model

Input (US$ / M tokens)

Output (US$ / M tokens)

Output /Input Ratio

Anthropic — Claude Opus 4.7

5.00

25.00

5x

OpenAI — GPT-5.5

5.00

30.00

6x

DeepSeek — V4 Pro

0.435

0.87

2x

Moonshot — Kimi K2.5

0.60

3.00

5x

Source: Anthropic, OpenAI, DeepSeek, Moonshot, and iFAST Compilation.

Data as of 26 May 2026.

To understand this price asymmetry, we must understand how a model lab makes money on inference. The lab's primary cost is GPU rental — every second a GPU runs, it burns cash. The core objective of any lab, therefore, is to serve as many user requests as possible within those rented GPU-hours.

To drive down per-unit cost, the lab pools multiple user requests together and processes them through the GPU simultaneously, a practice known as batching. The more users sharing one GPU pass, the lower the cost per user.

From a cost perspective, GPU inference breaks down into two distinct components:

  • Data-fetching cost: time spent loading model weights and any stored context from memory into the GPU's compute cores.
  • Compute cost: time spent on the actual mathematical operations — primarily matrix multiplications to predict the next token.

Prefill: Cheap, Because Almost Everything Is Shared

Prefill is the first phase of GPU inference, and it determines the cost of input tokens. When a user submits a prompt — say, a long document — the model processes the entire text in parallel, computing relationships between every pair of tokens in a single pass. Because the input is fixed up front, all of this work can be executed concurrently: the GPU only needs to read model weights and input data once, and both compute and data-fetching costs are sharply reduced.

Decode: Why Output Tokens Are So Expensive

Decode is the second phase of inference, during which the model produces output tokens one at a time. Unlike prefill, decode cannot be parallelised in a single pass — and that is why output tokens are structurally more expensive. Two factors drive this.

1.     There is no way to share costs within a single user.

Each generated token depends on the one before it: the second token requires the first to exist, the third requires the second, and so on. Output cannot be processed in parallel, the way prefill can. As a result, the cost of a single weight read is amortised across only one output token.

2.     Batching cannot scale indefinitely.

Users expect near-interactive response speeds, so an overly large batch increases the wait time each user experiences and drives up latency. For AI agents, the constraint is sharper still: an agent often executes tens or hundreds of sequential steps, and any per-decode latency penalty compounds rapidly.

In short, even with batching, decode cannot share cost as efficiently as prefill. The matrix math itself is fast; what truly caps throughput is the repeated movement of large volumes of weights and KV cache from memory into the compute cores for every output token. As we argued in Memory supercycle: A structural shift rewriting supply and demand paradigm, the AI bottleneck is not just computing — it is memory bandwidth and capacity.

Memory Supercycle | From HBM to Three-Tier Memory Hierarchy

As AI use cases continue to scale, labs are running into three forces converging at once, which together push memory demand even harder.

  • Context length explosion.

Frontier context windows have stretched from 200,000 tokens in 2024 to over 1 million tokens today. Since the KV cache* scales roughly in proportion to context length, the memory footprint per user during decode has expanded by an order of magnitude versus a year ago.

  • MoE (Mixture-of-Experts) architectures become the standard.

MoE splits a model into many specialist sub-networks, of which only a subset is activated per token. DeepSeek V3, for example, has 671 billion total parameters but activates just 37 billion per token. The remaining 634 billion must still live somewhere reachable — but housing all of them in HBM is not economically reasonable.

  • Concurrent user batching amplifies the memory footprint.

As discussed earlier, decode requires pooling a large number of concurrent users into a single batch to amortise the cost of repeated weight reads. The larger the batch, the more KV caches must remain live simultaneously, which scales the system-wide memory footprint dramatically.

*KV cache: the model's short-term memory held during generation, which avoids recomputing the entire prior context for every new token.

The combined effect is a near-vertical rise in memory demand. Even with NVIDIA's Rubin packing 288 GB of HBM per GPU, the most aggressive HBM capacity ever shipped at scale, the system cannot hold all of the data in a single tier. HBM is simultaneously hitting its capacity ceiling and its cost ceiling.

The industry's response is not simply to stack more HBM, but to bifurcate memory into a tiered hierarchy:

Table 2: The Emerging Three-Tier Memory Hierarchy for AI Inference

Tier

Memory Type

Role

Hot

HBM

Current decode step, active weights, hottest KV cache

Warm

LPDDR (SOCAMM)

Overflow KV cache, inactive MoE experts, prefill-stage context

Cold

NAND / SSD

Long-term context storage, infrequently accessed data

Source: SemiAnalysis and iFAST Compilation.

Data as of 26 May 2026.

The warm tier is the critical new addition. LPDDR5X (Low Power DDR — a low-power memory originally widespread in smartphones, now increasingly deployed in data centres for data storage) offers roughly 10x lower bandwidth than HBM but roughly 10x higher capacity per dollar. For data that needs reasonable access speed but does not require HBM's premium throughput, this is an economically attractive zone.

Memory Supercycle | SOCAMM — NVIDIA's Bet on the Warm Tier

SOCAMM (System-On-Chip Attached Memory Module) is a modular, socketed LPDDR5X memory module designed specifically by NVIDIA for the Rubin platform. In previous generations, including GB300, DRAM was typically soldered onto the board and its cost bundled invisibly into overall system pricing.

SOCAMM breaks that pattern: memory is unbundled into a separately quoted module, installed in a socket. This is not just a technical choice — it is a deliberate commercial design that accomplishes three things at once:

  • Let's NVIDIA unbundle memory from GPU pricing and reprice it independently, raising system gross margin without directly raising the GPU sticker price.
  • Converts memory from a one-time bundled component into an ongoing revenue stream that tracks the LPDDR market.
  • Positions NVIDIA as the gatekeeper between model labs and memory suppliers — an advantage that compounds under supply tightness.

Table 3: NVIDIA May Use SOCAMM2 Flexible Pricing to Build a New Margin Stream

Metric

Value

Notes

Q1 2026 procurement cost to NVIDIA (per GB)

~ US$8

Already a sharp step-up from the prior quarter, driven by LPDDR5X supply tightness across consumer and server markets

Exit-2026 forecast price (per GB)

Over US$13

Approximately 60% cumulative increase through 2026

NVIDIA resells gross margin

~ 60%

A new margin stream for NVIDIA

Source: SemiAnalysis and iFAST Compilation

Data as of 26 May 2026

NVIDIA controls the majority of the available LPDDR5X supply, and it has explicitly adopted a multi-sourcing strategy. Samsung Electronics, SK Hynix, and Micron Technology are all in volume production of SOCAMM2 for the Vera Rubin platform.

Table 4: SOCAMM2 Supplier Landscape for the Vera Rubin Platform

Supplier

Module Density

Estimated Share of NVIDIA Supply

Differentiation

Samsung Electronics

192 GB

~50% (share leader)

First to solve module warpage; earliest at-scale qualification

SK Hynix

192 GB

#2

Most advanced process node; mass production began April 2026

Micron Technology

256 GB

#3

Only supplier shipping the complete Vera Rubin memory triplet (HBM4 + SOCAMM2 + PCIe 5.0 SSD) at volume

Source: TrendForce, Samsung, SK Hynix, Micron, SemiAnalysis, and iFAST Compilation.

Data as of 26 May 2026.

Related Article: South Korea’s Memory Duopoly: How Samsung and SK Hynix are reshaping the Korean stock market?

Memory Supercycle | DeepSeek's Technics Reshape Memory Market

If SOCAMM is the hardware-side response to tiered memory, DeepSeek's architectural innovations are the model-side response. Through a series of compression mechanisms, DeepSeek has substantially reduced the memory footprint in long-context inference.

But compression does not reduce aggregate demand. It accelerates the migration of memory demand from expensive HBM into the warm-tier LPDDR and cold-tier NAND shown in Table 2, expanding the universe of memory beneficiaries from HBM leaders to LPDDR and NAND suppliers as well.

DeepSeek's compression toolkit consists of five distinct mechanisms, each attacking a different inefficiency in standard attention:

Table 5: DeepSeek's Memory Compression Innovations

Technique

Function

Multi-Head Latent Attention

Compresses the dispersed per-head KV cache into a single shared, condensed representation, sharply reducing redundant storage

DeepSeek Sparse Attention

Retains only the genuinely changed or important content, skipping large volumes of redundant or low-value information

Compressed Sparse Attention

Compresses the data further on top of the sparse-attention mechanism, lowering the memory footprint again

Heavily Compressed Attention

Organises long conversations or documents into tiered summaries, avoiding the need to pack every detail into expensive memory

Memory-Compute Trade-off Framework (Engram)

Uses more compute to reduce memory need — in short, “compute a bit more, store a bit less” (data can sit in cheaper LPDDR)

Source: DeepSeek and iFAST Compilation.

Data as of 26 May 2026.

Stacked together, these mechanisms allow DeepSeek's frontier model to operate at long context lengths with a far smaller HBM footprint.

Table 6: KV Cache Footprint Comparison at 1M-Token Context (8-bit precision)

Model

Parameters

KV Cache (1M context)

DeepSeek V4

1.6T (MoE)

5.48 GB

GLM5

~700B

60 GB

Qwen3-235B-A22B

235B (MoE)

89 GB

Source: kvcache.ai and iFAST Compilation.

Data as of 26 May 2026.

With memory aggressively compressed, aggregate memory demand does not necessarily fall — and may continue to rise.

  • Memory is reallocated to higher-value work.

The freed memory does not stay idle: labs immediately repurpose it for longer documents, multi-step reasoning, or invoking more tools within a single session. Memory demand does not disappear; it is reallocated to higher-value use cases.

  • Compression unlocks workloads that were previously economically infeasible.

Tasks that were prohibitive due to memory footprint — full codebase analysis, multi-document long-context review — become viable at scale once per-user memory drops. As these new workloads come online, aggregate memory demand expands rather than contracts.

  • Compression changes where memory lives, not how much is needed.

Compression and the tiered architecture jointly reshape which layer memory demand lands on. A smaller KV cache lets more active context stay in HBM, while inactive, low-frequency data — overflow KV cache, inactive MoE experts, and long-term context — migrate down to LPDDR and NAND. The total capacity the inference system needs does not disappear; it is redistributed along the memory hierarchy.

Memory Supercycle | The Four Beneficiaries

Along this memory-migration path, beyond the two Korean memory giants we frequently talk about, four companies stand out: NVIDIA, Micron Technology, CXMT, and YMTC.

Table 7: The 2H 2026 Memory Repricing Beneficiary Map

Name

Repricing Mechanism

2H 2026 Catalyst

Primary Upside Driver

NVIDIA

SOCAMM2 gatekeeper; ~60% gross margin on memory resell

Rubin VR NVL72 production ramp through 2H 2026

SOCAMM2 ASP step-up of ~60%; system-level value capture

Micron

Only supplier shipping the full Vera Rubin triplet (HBM4 + 256GB SOCAMM2 + PCIe 5.0 SSD)

FY2H quarterly disclosures show LPDDR mix shift

Direct LPDDR ASP exposure; product-breadth differentiation vs Samsung (share leader) and SK Hynix (process parity)

CXMT

Sole Chinese-domestic LPDDR5 supplier at scale; canonical partner for MoE LPDDR offload

STAR Market IPO clearance and pricing; Shanghai HBM3 back-end packaging start; DeepSeek V4 domestic deployment

Survival-path positioning under US export controls

YMTC

Cold-tier NAND beneficiary of DeepSeek-style KV cache offload to SSD

Formal STAR Market IPO filing in June 2026; Wuhan Phase III fab commissioning by year-end 2026

Sustained NAND demand driven by AI-agent workloads

Source: SemiAnalysis, TrendForce, Micron, and iFAST Compilation.

Data as of 26 May 2026.

NVIDIA and Micron — The Western Players

Although NVIDIA does not manufacture memory itself, it captures a premium through its control of the supply chain. Since SOCAMM2 is a separately purchased item, NVIDIA can capture around 60% gross margin on resale. As both Rubin VR NVL72 shipments and LPDDR5X prices are expected to rise in tandem through 2H 2026, SOCAMM2 is positioned to become a meaningful contributor to NVIDIA's profit pool.

Micron benefits in parallel. Samsung leads on supply share, and SK Hynix matches Micron on process and density, but Micron's differentiation lies in product breadth: it is the only supplier shipping the complete Vera Rubin memory triplet (HBM4 + 256GB SOCAMM2 + PCIe 5.0 SSD) at volume.

CXMT and YMTC — The China Memory Frontline

  • CXMT

CXMT is currently the world's fourth-largest DRAM manufacturer and the sole Chinese supplier of LPDDR5 at scale. Under US export controls that block Chinese access to leading-edge HBM, the company is extending into HBM, with roughly 20% of total wafer capacity allocated to HBM3 production and HBM3E targeted for mass production in 2026.

  • YMTC

YMTC is the primary beneficiary of the migration of memory demand into the cold NAND tier. Its two Wuhan fabs already operate at scale, the Phase III fab is on track for commissioning by year-end 2026, and capacity continues to expand. The Xtacking 4.0 architecture is technologically competitive, supporting YMTC's place among the top five global NAND IDMs; Q1 2026 revenue roughly doubled year-on-year.

YMTC's hybrid bonding expertise also carries longer-dated relevance for HBM development. As HBM stack heights rise and bandwidth/thermal requirements intensify, the existing packaging pilot line at subsidiary XMC, together with a potential collaboration with CXMT, keeps open the option for YMTC to participate in HBM stack-related business over time.

Table 8: CXMT and YMTC Financial Snapshot and IPO Timeline

Metric

CXMT

YMTC

Memory product

DRAM (LPDDR5; HBM3 in development)

3D NAND (Xtacking 4.0); HBM packaging via XMC subsidiary

FY2025 revenue

RMB 61.8 bn (USD 9.1 bn), +155.6% YoY

Not disclosed at this stage

FY2025 net profit

RMB 1.87 bn (first profitable year)

Not disclosed at this stage

Q1 2026 revenue

RMB 50.8 bn (USD 7.5 bn), +719.13% YoY

RMB > 20 bn (USD ~2.9 bn), c.2x YoY

Q1 2026 net profit

RMB 24.76 bn (USD 3.6 bn), +1,688.30% YoY

Not disclosed at this stage

Capacity / share

Utilisation 95.73%; 7.67% global DRAM (4th place)

~200,000 wpm across two fabs; ~13% global NAND (top-5)

IPO venue

Shanghai STAR Market

Shanghai STAR Market

IPO milestone

Cleared Listing Review Committee 27 May 2026; pending CSRC registration

IPO tutoring registration complete 19 May 2026; formal filing expected mid-June 2026

Source: CXMT, YMTC, Bloomberg, TechNode, BigGo Finance, and iFAST Compilation.

Data as of 26 May 2026.


Memory Supercycle | Investment Implications

For investors who prefer to access these opportunities through a diversified ETF basket, the following products are worth considering.

VanEck Semiconductor ETF (NASDAQ: SMH)

SMH offers the most direct exposure to the Western leg of the SOCAMM2 repricing trade. Through a single, highly liquid ETF, investors gain both NVIDIA's system-level value capture and Micron's component-level ASP upside exposure.

The forward P/E of the Index has risen above one standard deviation from its historical average, suggesting that good news has already been priced in. Investors should avoid chasing aggressively. A better strategy is to wait for valuation pullbacks or broader market corrections, then add exposure gradually.

Graph 1: MVIS US Listed Semiconductor 25 Index

Table 9: MVIS US Listed Semiconductor 25 Index Forecast

MVIS US Semiconductor 25 Index

2025A

2026E

2027E

2028E

EPS (USD)

329.0

795.4

1,097.1

1,292.3

Earning Growth

141.8%

37.9%

17.8%

Implied Valuation (fair P/E 24x)

31,015

Potential Upside

26%

Source: Bloomberg and iFAST Compilation

Data as of 26 May 2026.


Global X China Semiconductor ETF (HKEX: 3191)

Neither CXMT nor YMTC is yet listed, but investors may consider 3191 to participate in the China semiconductor ecosystem. Relative to SMH and 3119, however, the investment logic for 3191 is more event-driven, and its valuation is itself comparatively high. As such, the present moment may not be an appropriate entry point.

Related Article: Global X China Semiconductor ETF: Capturing Domestic Substitution and Memory Re-Rating

Global X Asia Semiconductor ETF (HKEX: 3119)

As noted in Table 4, exposure to the SOCAMM2 trade cannot be captured through Micron alone — roughly half of NVIDIA's supply sits with Samsung Electronics and SK Hynix. Both are meaningful holdings in 3119 and the ETF is traded at comparatively reasonable valuations, making it one of the more efficient vehicles for capturing the memory supercycle.

Graph 2: FactSet Asia Semiconductor Index

Table 11: FactSet Asia Semiconductor Index Forecast

FactSet Asia Semiconductor Index

2025A

2026E

2027E

2028E

EPS (HKD)

19.3

48.5

66.1

74.7

Earning Growth

151.7%

36.4%

12.9%

Implied Valuation (fair P/E 18x)

1,344

Potential Upside

76%

Source: Bloomberg and iFAST Compilation.

Data as of 26 May 2026.


Declaration:

For specific disclosure, at the time of publication of this report, IFPL (via its connected and associated entities) and the analyst who produced this report hold a NIL position in the abovementioned securities.

This research report was prepared with the assistance of artificial intelligence (AI) tools. iFAST Financial Pte Ltd does not rely exclusively on AI for content generation; the content of this report – including all investment theses, ratings, price targets and conclusions – has been independently reviewed and verified by the research analyst(s) to ensure accuracy and professional integrity.

All materials and contents found in this site are strictly for general circulation and informational purposes only and should not be considered as an offer, or solicitation, to deal in any of the funds or products found/identified in this site. While iFAST Financial Pte Ltd ("IFPL") has tried to provide accurate and timely information, there may be inadvertent delays, omissions, technical or factual inaccuracies and typographical errors. Any opinion or estimate contained in this report is made on a general basis and neither IFPL nor any of its servants or agents have given any consideration to nor have they or any of them made any investigation of the investment objective, financial situation or particular need of any user or reader, any specific person or group of persons. You should consider carefully if the products you are going to purchase are suitable for your investment objective, investment experience, risk tolerance and other personal circumstances. If you are uncertain about the suitability of the investment product, please seek advice from a financial adviser, before making a decision to purchase the investment product. Past performance is not indicative of future performance. The value of the investment products and the income from them may fall as well as rise. Opinions expressed herein are subject to change without notice. In respect of any matters arising from, or in connection with the said research analyses or research reports, recipients of the report are to contact IFPL at 10 Collyer Quay, #26-01 Ocean Financial Centre Building, Singapore 049315, or by telephone at +65 6557 2853. Where the report contains research analyses or research reports from a foreign research house and if the recipient of such research analyses or research reports is not an accredited investor, expert investor, institutional investor or an ex-accredited investor, IFPL accepts legal responsibility for the contents of such analyses or reports to such persons only to the extent as required by law. Please note that only certain security(ies) herein are available to all investors, while the rest are only available for certain persons to invest in, such as Accredited Investors (as defined in the Securities and Futures Act) or one who invests at least S$200,000 (or its equivalent currency) per transaction. To qualify as an Accredited Investor, one needs to submit a declaration form and certain relevant supporting documents, according to iFAST’s prevailing policies and procedures.

Please read our full disclaimers on the website at ( https://secure.fundsupermart.com/fsmone/policies/328125/investment-account-terms-&-conditions).

iFAST Financial Pte Ltd (IFPL) (registered address: 10 Collyer Quay #26-01 Ocean Financial Centre Singapore 049315, Telephone: 6557 2000) holds the Financial Advisers Licence issued by the Monetary Authority of Singapore ('MAS') to conduct regulated activities of advising on securities, marketing of collective investment schemes and arranging of any contract of insurance in respect of life policies, other than a contract of reinsurance and the Capital Markets Services Licence issued by the MAS to conduct regulated activities of dealing in securities and providing custodial services for securities. While IFPL has made every effort to ensure the independence of the report's contents, IFPL's nature of business is such that IFPL and its connected and associated entities together with their respective directors, officers and staff may be involved in providing dealing or investment-related services in the abovementioned securities, and have taken or may take positions in the securities mentioned in this report, and may also act as the principal for any buy or sell trades.

Ways to Invest with FSM Global
Why FSM Global
Don't have an account with us?
Open an account here
Need Financial Advice?
Make an appointment

We use cookies If you close this message or continue to use this site, you will consent to the use of Cookies, unless you choose to disable them. Click on our Privacy Policy to understand more.