- AI inference, long-context processing and AI agents are making memory bandwidth and capacity the next major hardware bottlenecks.
- Token pricing shows that output tokens are far more expensive than input tokens, reflecting the higher memory-reading and capacity requirements during the decode stage.
- Memory demand is expanding beyond HBM into LPDDR and NAND, creating re-rating opportunities across Nvidia, Micron, Samsung, SK Hynix, and China’s memory supply chain.
The AI investment narrative is entering its next chapter. In the last two years, the market has focused on GPU scarcity, but as inference demand surges, context windows lengthen, and the mass adoption of AI agents, the bottleneck is shifting from raw compute to memory bandwidth and capacity.
At the same time, the consensus that frontier model labs are structural cash incinerators is starting to look out of date. Anthropic's annualised revenue run rate has surged from US$9 billion at the end of 2025 to over US$30 billion by April 2026, and the company is on track for its first quarterly operating profit in Q2 2026.
Once labs earn profits, they have both the ability and the incentive to pay more for the hardware bottleneck supporting that growth. In other words, the repricing of the AI value chain is extending from the model layer to the hardware bottleneck. And the most underrated bottleneck in today's AI hardware stack is not just the GPU — it is memory.
Memory Supercycle | Investigate AI Bottleneck from the Price Tag
To see why memory has become the most important hardware bottleneck of the inference era, the cleanest starting point is the model labs' price tag. Token prices are not just a revenue line — they also reveal the underlying cost asymmetry between different inference phases.
Across the major frontier models in both the US and China, the price of an output token is dramatically higher than the price of an input token — typically by a factor of several times, or even an order of magnitude. Anthropic, for example, charges US$5 per million input tokens for Claude Opus but US$25 per million output tokens — a 5x gap. OpenAI, DeepSeek, Qwen, and Kimi all show the same asymmetric pattern on their published price cards.
Table 1: Token Pricing Asymmetry Across Major Frontier Labs
|
Lab — Model |
Input (US$ / M tokens) |
Output (US$ / M tokens) |
Output /Input Ratio |
|
Anthropic — Claude Opus 4.7 |
5.00 |
25.00 |
5x |
|
OpenAI — GPT-5.5 |
5.00 |
30.00 |
6x |
|
DeepSeek — V4 Pro |
0.435 |
0.87 |
2x |
|
Moonshot — Kimi K2.5 |
0.60 |
3.00 |
5x |
|
Source: Anthropic, OpenAI, DeepSeek, Moonshot, and iFAST Compilation. Data as of 26 May 2026. |
|||
To understand this price asymmetry, we must understand how a model lab makes money on inference. The lab's primary cost is GPU rental — every second a GPU runs, it burns cash. The core objective of any lab, therefore, is to serve as many user requests as possible within those rented GPU-hours.
To drive down per-unit cost, the lab pools multiple user requests together and processes them through the GPU simultaneously, a practice known as batching. The more users sharing one GPU pass, the lower the cost per user.
From a cost perspective, GPU inference breaks down into two distinct components:
- Data-fetching cost: time spent loading model weights and any stored context from memory into the GPU's compute cores.
- Compute cost: time spent on the actual mathematical operations — primarily matrix multiplications to predict the next token.
Prefill: Cheap, Because Almost Everything Is Shared
Prefill is the first phase of GPU inference, and it determines the cost of input tokens. When a user submits a prompt — say, a long document — the model processes the entire text in parallel, computing relationships between every pair of tokens in a single pass. Because the input is fixed up front, all of this work can be executed concurrently: the GPU only needs to read model weights and input data once, and both compute and data-fetching costs are sharply reduced.
Decode: Why Output Tokens Are So Expensive
Decode is the second phase of inference, during which the model produces output tokens one at a time. Unlike prefill, decode cannot be parallelised in a single pass — and that is why output tokens are structurally more expensive. Two factors drive this.
1. There is no way to share costs within a single user.
Each generated token depends on the one before it: the second token requires the first to exist, the third requires the second, and so on. Output cannot be processed in parallel, the way prefill can. As a result, the cost of a single weight read is amortised across only one output token.
2. Batching cannot scale indefinitely.
Users expect near-interactive response speeds, so an overly large batch increases the wait time each user experiences and drives up latency. For AI agents, the constraint is sharper still: an agent often executes tens or hundreds of sequential steps, and any per-decode latency penalty compounds rapidly.
In short, even with batching, decode cannot share cost as efficiently as prefill. The matrix math itself is fast; what truly caps throughput is the repeated movement of large volumes of weights and KV cache from memory into the compute cores for every output token. As we argued in “Memory supercycle: A structural shift rewriting supply and demand paradigm”, the AI bottleneck is not just computing — it is memory bandwidth and capacity.
Memory Supercycle | From HBM to Three-Tier Memory Hierarchy
As AI use cases continue to scale, labs are running into three forces converging at once, which together push memory demand even harder.
- Context length explosion.
Frontier context windows have stretched from 200,000 tokens in 2024 to over 1 million tokens today. Since the KV cache* scales roughly in proportion to context length, the memory footprint per user during decode has expanded by an order of magnitude versus a year ago.
- MoE (Mixture-of-Experts) architectures become the standard.
MoE splits a model into many specialist sub-networks, of which only a subset is activated per token. DeepSeek V3, for example, has 671 billion total parameters but activates just 37 billion per token. The remaining 634 billion must still live somewhere reachable — but housing all of them in HBM is not economically reasonable.
- Concurrent user batching amplifies the memory footprint.
As discussed earlier, decode requires pooling a large number of concurrent users into a single batch to amortise the cost of repeated weight reads. The larger the batch, the more KV caches must remain live simultaneously, which scales the system-wide memory footprint dramatically.
*KV cache: the model's short-term memory held during generation, which avoids recomputing the entire prior context for every new token.
The combined effect is a near-vertical rise in memory demand. Even with NVIDIA's Rubin packing 288 GB of HBM per GPU, the most aggressive HBM capacity ever shipped at scale, the system cannot hold all of the data in a single tier. HBM is simultaneously hitting its capacity ceiling and its cost ceiling.
The industry's response is not simply to stack more HBM, but to bifurcate memory into a tiered hierarchy:
Table 2: The Emerging Three-Tier Memory Hierarchy for AI Inference
|
Tier |
Memory Type |
Role |
|
Hot |
HBM |
Current decode step, active weights, hottest KV cache |
|
Warm |
LPDDR (SOCAMM) |
Overflow KV cache, inactive MoE experts, prefill-stage context |
|
Cold |
NAND / SSD |
Long-term context storage, infrequently accessed data |
|
Source: SemiAnalysis and iFAST Compilation. Data as of 26 May 2026. |
||
The warm tier is the critical new addition. LPDDR5X (Low Power DDR — a low-power memory originally widespread in smartphones, now increasingly deployed in data centres for data storage) offers roughly 10x lower bandwidth than HBM but roughly 10x higher capacity per dollar. For data that needs reasonable access speed but does not require HBM's premium throughput, this is an economically attractive zone.
Memory Supercycle | SOCAMM — NVIDIA's Bet on the Warm Tier
SOCAMM (System-On-Chip Attached Memory Module) is a modular, socketed LPDDR5X memory module designed specifically by NVIDIA for the Rubin platform. In previous generations, including GB300, DRAM was typically soldered onto the board and its cost bundled invisibly into overall system pricing.
SOCAMM breaks that pattern: memory is unbundled into a separately quoted module, installed in a socket. This is not just a technical choice — it is a deliberate commercial design that accomplishes three things at once:
- Let's NVIDIA unbundle memory from GPU pricing and reprice it independently, raising system gross margin without directly raising the GPU sticker price.
- Converts memory from a one-time bundled component into an ongoing revenue stream that tracks the LPDDR market.
- Positions NVIDIA as the gatekeeper between model labs and memory suppliers — an advantage that compounds under supply tightness.
Table 3: NVIDIA May Use SOCAMM2 Flexible Pricing to Build a New Margin Stream
|
Metric |
Value |
Notes |
|
Q1 2026 procurement cost to NVIDIA (per GB) |
~ US$8 |
Already a sharp step-up from the prior quarter, driven by LPDDR5X supply tightness across consumer and server markets |
|
Exit-2026 forecast price (per GB) |
Over US$13 |
Approximately 60% cumulative increase through 2026 |
|
NVIDIA resells gross margin |
~ 60% |
A new margin stream for NVIDIA |
|
Source: SemiAnalysis and iFAST Compilation Data as of 26 May 2026 |
||
NVIDIA controls the majority of the available LPDDR5X supply, and it has explicitly adopted a multi-sourcing strategy. Samsung Electronics, SK Hynix, and Micron Technology are all in volume production of SOCAMM2 for the Vera Rubin platform.
Table 4: SOCAMM2 Supplier Landscape for the Vera Rubin Platform
|
Supplier |
Module Density |
Estimated Share of NVIDIA Supply |
Differentiation |
|
Samsung Electronics |
192 GB |
~50% (share leader) |
First to solve module warpage; earliest at-scale qualification |
|
SK Hynix |
192 GB |
#2 |
Most advanced process node; mass production began April 2026 |
|
Micron Technology |
256 GB |
#3 |
Only supplier shipping the complete Vera Rubin memory triplet (HBM4 + SOCAMM2 + PCIe 5.0 SSD) at volume |
|
Source: TrendForce, Samsung, SK Hynix, Micron, SemiAnalysis, and iFAST Compilation. Data as of 26 May 2026. |
|||
Related Article: South Korea’s Memory Duopoly: How Samsung and SK Hynix are reshaping the Korean stock market?
Memory Supercycle | DeepSeek's Technics Reshape Memory Market
If SOCAMM is the hardware-side response to tiered memory, DeepSeek's architectural innovations are the model-side response. Through a series of compression mechanisms, DeepSeek has substantially reduced the memory footprint in long-context inference.
But compression does not reduce aggregate demand. It accelerates the migration of memory demand from expensive HBM into the warm-tier LPDDR and cold-tier NAND shown in Table 2, expanding the universe of memory beneficiaries from HBM leaders to LPDDR and NAND suppliers as well.
DeepSeek's compression toolkit consists of five distinct mechanisms, each attacking a different inefficiency in standard attention:
Table 5: DeepSeek's Memory Compression Innovations
|
Technique |
Function |
|
Multi-Head Latent Attention |
Compresses the dispersed per-head KV cache into a single shared, condensed representation, sharply reducing redundant storage |
|
DeepSeek Sparse Attention |
Retains only the genuinely changed or important content, skipping large volumes of redundant or low-value information |
|
Compressed Sparse Attention |
Compresses the data further on top of the sparse-attention mechanism, lowering the memory footprint again |
|
Heavily Compressed Attention |
Organises long conversations or documents into tiered summaries, avoiding the need to pack every detail into expensive memory |
|
Memory-Compute Trade-off Framework (Engram) |
Uses more compute to reduce memory need — in short, “compute a bit more, store a bit less” (data can sit in cheaper LPDDR) |
|
Source: DeepSeek and iFAST Compilation. Data as of 26 May 2026. |
|
Stacked together, these mechanisms allow DeepSeek's frontier model to operate at long context lengths with a far smaller HBM footprint.
Table 6: KV Cache Footprint Comparison at 1M-Token Context (8-bit precision)
|
Model |
Parameters |
KV Cache (1M context) |
|
DeepSeek V4 |
1.6T (MoE) |
5.48 GB |
|
GLM5 |
~700B |
60 GB |
|
Qwen3-235B-A22B |
235B (MoE) |
89 GB |
|
Source: kvcache.ai and iFAST Compilation. Data as of 26 May 2026. |
||
With memory aggressively compressed, aggregate memory demand does not necessarily fall — and may continue to rise.
- Memory is reallocated to higher-value work.
The freed memory does not stay idle: labs immediately repurpose it for longer documents, multi-step reasoning, or invoking more tools within a single session. Memory demand does not disappear; it is reallocated to higher-value use cases.
- Compression unlocks workloads that were previously economically infeasible.
Tasks that were prohibitive due to memory footprint — full codebase analysis, multi-document long-context review — become viable at scale once per-user memory drops. As these new workloads come online, aggregate memory demand expands rather than contracts.
- Compression changes where memory lives, not how much is needed.
Compression and the tiered architecture jointly reshape which layer memory demand lands on. A smaller KV cache lets more active context stay in HBM, while inactive, low-frequency data — overflow KV cache, inactive MoE experts, and long-term context — migrate down to LPDDR and NAND. The total capacity the inference system needs does not disappear; it is redistributed along the memory hierarchy.
Along this memory-migration path, beyond the two Korean memory giants we frequently talk about, four companies stand out: NVIDIA, Micron Technology, CXMT, and YMTC.
Table 7: The 2H 2026 Memory Repricing Beneficiary Map
|
Name |
Repricing Mechanism |
2H 2026 Catalyst |
Primary Upside Driver |
|
NVIDIA |
SOCAMM2 gatekeeper; ~60% gross margin on memory resell |
Rubin VR NVL72 production ramp through 2H 2026 |
SOCAMM2 ASP step-up of ~60%; system-level value capture |
|
Micron |
Only supplier shipping the full Vera Rubin triplet (HBM4 + 256GB SOCAMM2 + PCIe 5.0 SSD) |
FY2H quarterly disclosures show LPDDR mix shift |
Direct LPDDR ASP exposure; product-breadth differentiation vs Samsung (share leader) and SK Hynix (process parity) |
|
CXMT |
Sole Chinese-domestic LPDDR5 supplier at scale; canonical partner for MoE LPDDR offload |
STAR Market IPO clearance and pricing; Shanghai HBM3 back-end packaging start; DeepSeek V4 domestic deployment |
Survival-path positioning under US export controls |
|
YMTC |
Cold-tier NAND beneficiary of DeepSeek-style KV cache offload to SSD |
Formal STAR Market IPO filing in June 2026; Wuhan Phase III fab commissioning by year-end 2026 |
Sustained NAND demand driven by AI-agent workloads |
|
Source: SemiAnalysis, TrendForce, Micron, and iFAST Compilation. Data as of 26 May 2026. |
|||
NVIDIA and Micron — The Western Players
Although NVIDIA does not manufacture memory itself, it captures a premium through its control of the supply chain. Since SOCAMM2 is a separately purchased item, NVIDIA can capture around 60% gross margin on resale. As both Rubin VR NVL72 shipments and LPDDR5X prices are expected to rise in tandem through 2H 2026, SOCAMM2 is positioned to become a meaningful contributor to NVIDIA's profit pool.
Micron benefits in parallel. Samsung leads on supply share, and SK Hynix matches Micron on process and density, but Micron's differentiation lies in product breadth: it is the only supplier shipping the complete Vera Rubin memory triplet (HBM4 + 256GB SOCAMM2 + PCIe 5.0 SSD) at volume.
- CXMT
CXMT is currently the world's fourth-largest DRAM manufacturer and the sole Chinese supplier of LPDDR5 at scale. Under US export controls that block Chinese access to leading-edge HBM, the company is extending into HBM, with roughly 20% of total wafer capacity allocated to HBM3 production and HBM3E targeted for mass production in 2026.
- YMTC
YMTC is the primary beneficiary of the migration of memory demand into the cold NAND tier. Its two Wuhan fabs already operate at scale, the Phase III fab is on track for commissioning by year-end 2026, and capacity continues to expand. The Xtacking 4.0 architecture is technologically competitive, supporting YMTC's place among the top five global NAND IDMs; Q1 2026 revenue roughly doubled year-on-year.
YMTC's hybrid bonding expertise also carries longer-dated relevance for HBM development. As HBM stack heights rise and bandwidth/thermal requirements intensify, the existing packaging pilot line at subsidiary XMC, together with a potential collaboration with CXMT, keeps open the option for YMTC to participate in HBM stack-related business over time.
Table 8: CXMT and YMTC Financial Snapshot and IPO Timeline
|
Metric |
CXMT |
YMTC |
|
Memory product |
DRAM (LPDDR5; HBM3 in development) |
3D NAND (Xtacking 4.0); HBM packaging via XMC subsidiary |
|
FY2025 revenue |
RMB 61.8 bn (USD 9.1 bn), +155.6% YoY |
Not disclosed at this stage |
|
FY2025 net profit |
RMB 1.87 bn (first profitable year) |
Not disclosed at this stage |
|
Q1 2026 revenue |
RMB 50.8 bn (USD 7.5 bn), +719.13% YoY |
RMB > 20 bn (USD ~2.9 bn), c.2x YoY |
|
Q1 2026 net profit |
RMB 24.76 bn (USD 3.6 bn), +1,688.30% YoY |
Not disclosed at this stage |
|
Capacity / share |
Utilisation 95.73%; 7.67% global DRAM (4th place) |
~200,000 wpm across two fabs; ~13% global NAND (top-5) |
|
IPO venue |
Shanghai STAR Market |
Shanghai STAR Market |
|
IPO milestone |
Cleared Listing Review Committee 27 May 2026; pending CSRC registration |
IPO tutoring registration complete 19 May 2026; formal filing expected mid-June 2026 |
|
Source: CXMT, YMTC, Bloomberg, TechNode, BigGo Finance, and iFAST Compilation. Data as of 26 May 2026. |
||
Memory Supercycle | Investment Implications
For investors who prefer to access these opportunities through a diversified ETF basket, the following products are worth considering.
VanEck Semiconductor ETF (NASDAQ: SMH)
SMH offers the most direct exposure to the Western leg of the SOCAMM2 repricing trade. Through a single, highly liquid ETF, investors gain both NVIDIA's system-level value capture and Micron's component-level ASP upside exposure.
The forward P/E of the Index has risen above one standard deviation from its historical average, suggesting that good news has already been priced in. Investors should avoid chasing aggressively. A better strategy is to wait for valuation pullbacks or broader market corrections, then add exposure gradually.
Graph 1: MVIS US Listed Semiconductor 25 Index

Table 9: MVIS US Listed Semiconductor 25 Index Forecast
|
MVIS US Semiconductor 25 Index |
2025A |
2026E |
2027E |
2028E |
|
EPS (USD) |
329.0 |
795.4 |
1,097.1 |
1,292.3 |
|
Earning Growth |
141.8% |
37.9% |
17.8% |
|
|
Implied Valuation (fair P/E 24x) |
31,015 |
|||
|
Potential Upside |
26% |
|||
|
Source: Bloomberg and iFAST Compilation Data as of 26 May 2026. |
||||
Global X China Semiconductor ETF (HKEX: 3191)
Neither CXMT nor YMTC is yet listed, but investors may consider 3191 to participate in the China semiconductor ecosystem. Relative to SMH and 3119, however, the investment logic for 3191 is more event-driven, and its valuation is itself comparatively high. As such, the present moment may not be an appropriate entry point.
Related Article: Global X China Semiconductor ETF: Capturing Domestic Substitution and Memory Re-Rating
Global X Asia Semiconductor ETF (HKEX: 3119)
As noted in Table 4, exposure to the SOCAMM2 trade cannot be captured through Micron alone — roughly half of NVIDIA's supply sits with Samsung Electronics and SK Hynix. Both are meaningful holdings in 3119 and the ETF is traded at comparatively reasonable valuations, making it one of the more efficient vehicles for capturing the memory supercycle.
Graph 2: FactSet Asia Semiconductor Index

Table 11: FactSet Asia Semiconductor Index Forecast
|
FactSet Asia Semiconductor Index |
2025A |
2026E |
2027E |
2028E |
|
EPS (HKD) |
19.3 |
48.5 |
66.1 |
74.7 |
|
Earning Growth |
151.7% |
36.4% |
12.9% |
|
|
Implied Valuation (fair P/E 18x) |
1,344 |
|||
|
Potential Upside |
76% |
|||
|
Source: Bloomberg and iFAST Compilation. Data as of 26 May 2026. |
||||
Declaration:
For specific disclosure, at the time of publication of this report, IFPL (via its connected and associated entities) and the analyst who produced this report hold a NIL position in the abovementioned securities.
This research report was prepared with the assistance of artificial intelligence (AI) tools. iFAST Financial Pte Ltd does not rely exclusively on AI for content generation; the content of this report – including all investment theses, ratings, price targets and conclusions – has been independently reviewed and verified by the research analyst(s) to ensure accuracy and professional integrity.
All materials and contents found in this site are strictly for general circulation and informational purposes only and should not be considered as an offer, or solicitation, to deal in any of the funds or products found/identified in this site. While iFAST Financial Pte Ltd ("IFPL") has tried to provide accurate and timely information, there may be inadvertent delays, omissions, technical or factual inaccuracies and typographical errors. Any opinion or estimate contained in this report is made on a general basis and neither IFPL nor any of its servants or agents have given any consideration to nor have they or any of them made any investigation of the investment objective, financial situation or particular need of any user or reader, any specific person or group of persons. You should consider carefully if the products you are going to purchase are suitable for your investment objective, investment experience, risk tolerance and other personal circumstances. If you are uncertain about the suitability of the investment product, please seek advice from a financial adviser, before making a decision to purchase the investment product. Past performance is not indicative of future performance. The value of the investment products and the income from them may fall as well as rise. Opinions expressed herein are subject to change without notice. In respect of any matters arising from, or in connection with the said research analyses or research reports, recipients of the report are to contact IFPL at 10 Collyer Quay, #26-01 Ocean Financial Centre Building, Singapore 049315, or by telephone at +65 6557 2853. Where the report contains research analyses or research reports from a foreign research house and if the recipient of such research analyses or research reports is not an accredited investor, expert investor, institutional investor or an ex-accredited investor, IFPL accepts legal responsibility for the contents of such analyses or reports to such persons only to the extent as required by law. Please note that only certain security(ies) herein are available to all investors, while the rest are only available for certain persons to invest in, such as Accredited Investors (as defined in the Securities and Futures Act) or one who invests at least S$200,000 (or its equivalent currency) per transaction. To qualify as an Accredited Investor, one needs to submit a declaration form and certain relevant supporting documents, according to iFAST’s prevailing policies and procedures.
Please read our full disclaimers on the website at ( https://secure.fundsupermart.com/fsmone/policies/328125/investment-account-terms-&-conditions).
iFAST Financial Pte Ltd (IFPL) (registered address: 10 Collyer Quay #26-01 Ocean Financial Centre Singapore 049315, Telephone: 6557 2000) holds the Financial Advisers Licence issued by the Monetary Authority of Singapore ('MAS') to conduct regulated activities of advising on securities, marketing of collective investment schemes and arranging of any contract of insurance in respect of life policies, other than a contract of reinsurance and the Capital Markets Services Licence issued by the MAS to conduct regulated activities of dealing in securities and providing custodial services for securities. While IFPL has made every effort to ensure the independence of the report's contents, IFPL's nature of business is such that IFPL and its connected and associated entities together with their respective directors, officers and staff may be involved in providing dealing or investment-related services in the abovementioned securities, and have taken or may take positions in the securities mentioned in this report, and may also act as the principal for any buy or sell trades.
