## **Overview of Fugaku and its Future Perspectives**

Satoshi Matsuoka, Director Riken R-CCS ASCAC Presentation 29 July 2021<sup>1</sup>



Le.



## What is a 'Exascale' Supercomputer?

## 1. FP64 Performance > 1 Exaflop (EF)

1.1. Achieve Rpeak (FP64) > 1 EF

1.2. Achieve Top500 – Linpack Rmax > 1 EF

- Fugaku Rmax = 0.442 EF, Rpea = 0.537 EF-> NG
- However, very little correlation to real apps, symbolic

## 2. Any floating point precision performance > 1 Exaflop

1.1. Peak FP performance > 1 EF

1.2. Measured performance in credible app or benchmark

- Fugaku FP32, FP16 Peak, HPL-AI (2EF) > 1 Exaflop -> OK!
- However, ORNL Summit: FP16 Peak ~= 3 EF, GB2018 App ~= 2EF
- 3. Real apps ~= 50~100x 2011~12 10~20PF SCs
- Fugaku ~70x c.f. K (11PF Rmax) on 9 target apps
- "Applications First" -> The most important metric









## Fugaku: Largest & Fastest Supercomputer Ever



'Applications First' R&D Challenge--- High Risk "Moonshot" R&D

A new high performance & low power Arm <u>A64FX CPU</u> co-developed by Riken R-CCS & Fujitsu along with nationwide HPC researchers as a <u>National Flagship 2020</u> project

"Moonshot"

R&D Target



R

אוק אוק

- 3x perf c.f. top CPU in HPC apps
- 3x power efficiency c.f. top CPU
- General purpose Arm CPU, runs sa me program as Smartphones
- Acceleration features for AI

#### Fugaku x 2~3 = Entire annual IT in Japan

|              | Smartphones                                |   | Servers (incl.<br>IDC)                          |     | Fugaku                   | K<br>Computer                                         |
|--------------|--------------------------------------------|---|-------------------------------------------------|-----|--------------------------|-------------------------------------------------------|
| Untis        | 20 million<br>~annual shipment<br>in Japan | = | <b>300,000</b><br>(~annual<br>shipment in Japan | =   | <b>1</b><br>(160K nodes) | Max 120                                               |
| Power<br>(W) | 10W×2,000万台=<br>200MW                      | = | 600-700W×30万台=<br>200MW<br>(incl cooling)       | \ \ | 30MW<br>(very low)       | 15MW<br>(less than 1/10<br>efficiency c.f.<br>Fugaku) |

#### Developed via extensive co-design

"Science of Computing" By Riken & Fujitsu & HPCI Centers, etc., Arm Ecosystem, Reflecting numerous research results





#### "Science by Computing"

"9 Priority Areas" to develop target applications to tackle important societal problems



- Advanced Applications **Co-Design Program to** Parallel Fugaku R&D
- Select one representative app from 9 priority areas
  - Health & Medicine
- Environment & Disaster **SDGs** 
  - Energy

6 RIKEN

Goals

- Materials & Manufacturing
- Basic Sciences
- Up to 100x speedup c.f. K-Computer => achieved!

## A64FX CPU for supercomputers



All-in-one 7nm SoC w/ low power consumption

- Armv8.2-A, 512-bit SVE (Scalable Vector Extension)
- Four HBM2, 32 GiB per package
- Tofu Interconnect D integrated
- HW inter-core barrier & sector cache
- 48 compute cores &
  - 4 assistant cores for OS daemon & MPI offload

| CPU core frequency    | 1.8 | 2.0  | 2.2 | GHz    |
|-----------------------|-----|------|-----|--------|
| Peak DP perf (FP64)   | 2.7 | 3.0  | 3.3 | TFLOPS |
| Peak SP perf (FP32)   | 5.5 | 6.1  | 6.7 | TFLOPS |
| Peak HP perf (FP16)   | 11  | 12   | 13  | TFLOPS |
| Memory peak bandwidth |     | 1024 |     | GB/s   |



A64FX w/o LID

## **Fugaku: World Leading Advanced IT (not just SC)**

- CPU: Highest performing general purpose CPU for high-end computing
  - First server CPU w/7nm process
  - 3x faster c.f. latest CPUs from US competitors w/SVE & HBM2, etc.
  - 3x power efficient -> GPU-class power efficiency
  - Arm v8.2 ISA compliant (own  $\mu$ -architecture) => e.g. RHEL works out of the box
- Network/Interconnect: highest bandwidth & lowest latency (Tofu-D)
  - 400Gbps-class network/node, 0.5µs latency (c.f. IDC 10~100Gbps, 10~100µs latency)
  - First server CPU w/ on-die NIC & switch = > 160K nodes interconnected w/o external switch, 1.6 million switch ports, > 100K AoC cables
  - $\sim 6$  PetaByte/s injection bandwidth => 10x aggregate GAFAM IDCs traffic

#### System Architecture => World's first ultra-scale disaggregated architecture

• CPU cores (esp. L2 Cache), memory (HBM2) and NIC all connected via on-chip network with multiple DMACs = any memory region in the system of 160K noes accessible by any CPU via RDMA and injected onto on-die L2 cache w/sub-µs latecy

## Fugaku's FUjitsu A64fx Processor is…

• an Many-Core ARM CPU····

R

RIKEN

- 48 compute cores + 2 or 4 assistant (OS) cores
- Brand new core design
- Near Xeon-Class Integer performance core
- ARM V8 --- 64bit ARM ecosystem
- Tofu-D + PCIe 3 external connection
- …but also an accelerated GPU-like processor
  - SVE 512 bit x 2 vector extensions (ARM & Fujitsu)
    - Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes)
  - Cache + memory localization (sector cache)
  - HBM2 on package memory Massive Mem BW (Bytes/DPF ~0.4)
    - Streaming memory access, strided access, scatter/gather etc.
  - Intra-chip barrier synch. and other memory enhancing features
- GPU-like performance in real-world HPC especially CFD-- Weather & Climate (even w/traditional Fortran code) + AI/Big Data





## A64FX CPU power efficiency for real apps

FUjitsu

- Performance /Energy consumption on an A64FX @ 2.2GHz
- Up to 3.7x more efficient over the latest x86 processor (24core, 2.9GHz) x2
- High efficiency is achieved by energy-conscious design and implementation







- Tofu-D logic Embedded into CPU die
- 25mm<sup>2</sup> die area (~6% of entire die)
- Power: 8~9W (incl. SerDes&AOC, very low power c.f. 100GbE, EDR/HDR IB @ 25-30W)
  - Constant irrespective of state
  - ~ 4~5 % of entire node
- Directly connected to on-chip torus network
  - No I/O bus inbetween e.g. PCI-E
  - Direct DMAC access to L2 cache
- 6-D torus router switch + DMAC
  - ~160,000 low dimension switch on Fugaku
  - ~1.6 million ports total
- CPU, Memory, and Tofu-D directly connected to on-chip Xbar & NW => disaggregated architecture



## Fugaku Tofu-D Performance

FUjitsu

#### 8B Put transfer between nodes on the same board

|          | Communication settings    | Latency |
|----------|---------------------------|---------|
| Tofu1(K) | Descriptor on main memory | 1.15 µs |
|          | Direct Descriptor         | 0.91 µs |
| TofuD    | To/From far CMGs          | 0.54 μs |
|          | To/From near CMGs         | 0.49 μs |

C.f. 100GbE in IDC Latency 10~100µs

#### Total Injection Bandwidth

|              | Injection rate | Efficiency  |                    |
|--------------|----------------|-------------|--------------------|
| Tofu1 (K)    | 15.0 GB/s      | 77 %        | C.f. 100GbE in IDC |
| Tofu1 (FX10) | 17.6 GB/s      | 88 %        | Bandwidth ~10GB/s  |
| TofuD        | 38.1 GB/s      | <b>93 %</b> |                    |



## **Disaggregated Architecture of A64FX**



- Any CPU can access any memory in system via RDMA (TNI) to its L2
  - Entire 160K Fugaku Nodes
  - Sub microsecond latency
  - NOC + Tofu-D NW Switch on every node (on-die)

CMG Configuration (13 cores + L2 + MC=>HBM2)





HBM2=>NoC=>TNI=>SW...AoC...SW=>TNI=>NoC=>L2&HBM2



## Fugaku Total System Config & Performance



#### • Total # Nodes: 158,976 nodes

- 384 nodes/rack x 396 (full) racks = 152,064 nodes
- 192 nodes/rack x 36 (half) racks = 6,912 nodes
  c.f. K Computer 88,128 nodes
- Theoretical Peak Compute Performances
  - Normal Mode (CPU Frequency 2GHz)
    - 64 bit Double Precision FP: 488 Petaflops
    - 32 bit Single Precision FP: 977 Petaflops
    - 16 bit Half Precision FP (AI training): 1.95 Exaflops
    - 8 bit Integer (AI Inference): 3.90 Exaops
  - Boost Mode (CPU Frequency 2.2GHz)
    - 64 bit Double Precision FP: 537 Petaflops
    - 32 bit Single Precision FP: 1.07 Exaflops
    - 16 bit Half Precision FP (AI training): 2.15 Exaflops
    - 8 bit Integer (Al Inference): 4.30 Exaops
- Theoretical Peak Memory Bandwidth: 163 Petabytes/s



#### <u>C.f. K Computer performance comparison (Boost)</u>

- 64 bit Double Precision FP: 48x
- 32 bit Single Precision: 95x
- 16 bit Half Precision (AI training): 190x
  - K Computer Theoretical Peak: 11.28 PF for all precisions
- 8 bit Integer (AI Inference): > 1,500x
  - K Computer Theoretical Peak: 2.82 Petaops (64 bits)
- Theoretical Peak Memory Bandwidth: 29x
  - K Computer Theoretical Peak: 5.64 Petabytes/s



#### Fugaku HPC+Big Data+AI+Cloud 'Converged' Software Stack





## Standard Software Ecosystem & OSS Contributions



Arm v8.2 + SVE and other server standards fully compliant

R

- Standard Linux distributions work out of the box, most Cloud, HPC, BD OSSs as well
- Standardized configurations via frameworks (e.g., OneAPI, Spack), VMs, Containers
- High Performance AI being developed w/OneDNN & others)





Most Software on x86 HPC Clusters & Clouds Simply Work on Fugaku

### **Cloud Service Providers Partnership**

https://www.r-ccs.riken.jp/library/topics/200213.html (in Japanese)



#### Action Items

- Cool Project name and logo!
- Trial methods to provide computing resources of Fugaku to end-users via service providers •
- Evaluate the effectiveness of the methods quantitatively as possible and organize the issues ٠
- The knowledges gained will be feedbacked to scheme design of Fugaku by the government •

## **Fugaku HPL-AI Results Comparisons (update Jun 2021)**

- Compute units utilized (FP16)
  - A64FX: 32-element vector FP16 & FP64 mixed precision
  - GPUs: FP16 Matrix Engine (Tensor Core) & FP64 mixed precision
- FP16 vast difference in efficiency, while FP64 efficiency similar
- See our latest paper "Matrix Engines for High Performance Computing: A Paragon of Performance or Grasping at Straws?" [IEEE IPDPS 2021] https://arxiv.org/abs/2010.14373
- We will also release our code as OSS RSN to become a standard like HPL

|           | Main<br>Processor | HPL-AI<br>Measured<br>Performance | FP16 Peak<br>Performance<br>(full machine) | Efficiency | HPL-AI<br>Performance<br>/Chip | Top500 /Linpack<br>FP64 Measured<br>Performance | FP64 Peak<br>Performance | Efficiency |
|-----------|-------------------|-----------------------------------|--------------------------------------------|------------|--------------------------------|-------------------------------------------------|--------------------------|------------|
| 1. Fugaku | Fujitsu<br>A64FX  | 2.00 EF                           | 2.14 EF                                    | 93.2%      | 12.6TF                         | 442.01 PF                                       | 537.21 PF                | 82.3%      |
| 2. Summit | NVIDIA<br>V100    | 1.15 EF                           | 3.46 EF                                    | 33.2%      | 42.6TF                         | 148.60PF                                        | 200.79 PF                | 74.0%      |
| 3. Selene | NVIDIA<br>A100    | 0.63 EF                           | 1.40 EF                                    | 45.0%      | 140.6TF                        | 63.46 PF                                        | 79.22 PF                 | 80.1%      |

Note: Selene node count based on prerelease info<sup>16</sup>

## Development of DL software stack for Arm SVE



# FUJITSU

Framework & oneDNN porting & tuning

Naoki Shinjo, Akira Asato, Atsushi
Ike, Koutarou Okazaki, Yoshihiko
Oguchi, Masahiro Doteguchi, Jin
Takahashi, Kazutoshi Akao, Masaya
Kato, Takashi Sawada,
Naoto Fukumoto, Kentaro Kawakami,
Naoki Sueyasu, Kouji Kurihara,
Masafumi Yamazaki, Takumi Honda



**Fugaku Al project** Signed on Nov. 25, 2019



#### **Tuning for Fugaku**

Satoshi Matsuoka, High Performance Artificial Intelligence Systems Research Team Leader Kento Sato, High Performance Big Data Research Team Leader Kazuo Minami, Application Tuning Development Unit Leader Akiyoshi Kuroda, Application Tuning Development Unit



## Cybozu<sup>®</sup>Labs

Technical

support

Fugaku

Α

project

Shigeo Mitsunari

## A64FX preliminary results for Deep Learning



### Setup

- Using the same number of CPU cores
  - FX1000 single node (A64FX 2.2 GHz) vs.
     Xeon Platinum 8268 (24 core, 2.9GHz) x2
- ResNet50 (image classification)
- OpenNMT (natural lang. processing)

## Results

- Performance:
  - Almost the same performance as Xeon
- Energy efficiency:
  - Up to 2.8x more efficient over Xeon









FUĴĨTSU



O,

Services 🕑 Products 👽 Solutions 👽 Support 👽 About Fujitsu 👽

Home > About Fujitsu > Resource Center > News > Press releases > 2020 > Fujitsu, AIST, and RIKEN Achieve Unparalleled Speed on the MLPerf HPC Machine Learning Processing Benchmark Leveraging Leading Japanese Supercomputer Systems

| Press releases | Fujitsu, AIST, and RIKEN Achieve Unparalleled Speed on the MLPerf                                                                                                                               |
|----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| > 2020         | HPC Machine Learning Processing Benchmark Leveraging Leading<br>Japanese Supercomputer Systems                                                                                                  |
| > 2019         |                                                                                                                                                                                                 |
| > 2018         | Fujitsu Limited, National Institute of Advanced Industrial Science and<br>Technology, RIKEN                                                                                                     |
| > 2017         | Tokyo, November 19, 2020                                                                                                                                                                        |
| > 2016         | Fujitsu, the National Institute of Advanced Industrial Science and Technology (AIST), and RIKEN today announced a performance milestone in supercomputing, achieving the highest performance    |
| > 2015         | and claiming the ranking positions on the MLPerf HPC benchmark <sup>(1)</sup> . The MLPerf HPC benchmark                                                                                        |
| > 2014         | measures large-scale machine learning processing on a level requiring supercomputers, and the parties achieved these outcomes leveraging approximately half of the "AI-Bridging Cloud           |
| > 2013         | Infrastructure" ("ABCI") supercomputer system, operated by AIST, and about 1/10 of the resources of the supercomputer Fugaku, which is currently under joint development by RIKEN and Fujitsu.  |
| > 2012         |                                                                                                                                                                                                 |
| > 2011         | Utilizing about half the computing resources of its system, ABCI achieved processing speeds 20 times faster than other GPU-type systems. That is the highest performance among supercomputers   |
| > 2010         | based on GPUs, computing devices specialized in deep learning. Similarly, about 1/10 of Fugaku was utilized to set a record for CPU-type supercomputers consisting of general-purpose computing |
| > 2009         | devices only, achieving a processing speed 14 times faster than that of other CPU-type systems.                                                                                                 |
| > 2008         | The results were presented as MLPerf HPC v0.7 on November 18th (November 19th Japan Time) at                                                                                                    |
| > 2007         | the 2020 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC20) event, which is currently being held online.                                         |

## **FY2020-2022 Fugaku Research Promotion Program (1)**

- Area I Challenges for Solving Universal Problems of Mankind or Pioneering the Future
  - Toward a unified view of the universe: from large scale structures to planets
  - Exploration of burning plasma confinement physics
  - Simulation for basic science: from fundamental laws of particles to creation of nuclei
  - Basic science for emergence and functionality of quantum matter
  - Biomolecular dynamics and function in a living cell using atomistic and coarse-grained simulations
  - Unravelling origin of cancer and diversity by large-scale data analysis and artificial intelligence technology
  - Human-scale whole brain simulation with connectome analysis and structure-function estimation

- Area II Reinforcement of Efforts on Protecting People's Life and Property
  - Overcoming heart failure pandemic with innovative integration of multi-scale heart simulator and large-scale clinical data
  - Large-scale numerical simulation of earthquake generation, wave propagation and soil amplification
  - Large Ensemble Atmospheric and Environmental Prediction for Disaster Prevention and Mitigation
  - Promotion of innovative drug discovery infrastructure for acceleration of precision medicine

Follow-on to the Priority Areas Program --- Real S&T Research Expected to Promote Fugaku

## FY2020-2022 Fugaku Research Promotion Program (2)

#### Area III Enhancement of Industrial Competitiveness

- **Environment-Compatible Chemical Substances**
- Multiscale simulations based on quantum theory toward the developments of energysaving next-generation semiconductor devices
- Digital Twins of Real World's Clean Energy Systems with Integrated Utilization of Supersimulation and AI
- Development of high-performance permanent magnets by large-scale simulation and datadriven approach
- Computational and Data Science Study for ET Revolution by Development of Next-Generation Battery and Fuel Cell
- R&D of innovative fluid-dynamics simulations for aerodynamical/hydrodynamical performance predictions by using Fugaku (R&D of a turbomachinery design simulation system)

- Area II Reinforcement of Efforts on Protecting **People's Life and Property** 
  - R&D of innovative fluid-dynamics simulations for aerodynamical/hydrodynamical performance predictions by using Fugaku (R&D of a vehicle design simulation system)
  - Leading research on innovative aircraft design technologies to replace flight test

#### Area IV Research Infrastructure

Development of personalized medical support technology based on simulation data science of whole brain blood circulation



### US DoE Labs Fugaku Usage via DoE-MEXT Partnership



- As a new phase of DoE MEXT collaboration, DoE Labs and ECP were given opportunities to port their code to Fugaku for evaluation, with support from Riken and Fujitsu since January 2021
- Despite only 3 month in a brand new (and huge) environment, some teams were very successful in obtaining excellent performance and scalability results.
- Some groups naturally suffered primarily performance issues due to compilers, libraries, etc. Collaboration will continue till at least March 2022 to work on such problems to demonstrate Arm/SVE viability.
- Many thanks to Doug Kothe, Thuc Hoang, Mike Heroux, Lori Diachin, and all the members from the DoE labs and their collaborators that are taking part!

#### • ECP groups:

- E3SM-MMF
- CEED
- ExaSky
- ExaGraph
- Kokkos
- SLATE & HeFFTe & xSDK & PaPSEC & PAPI
- PETSc/TAO

- ASC and other DOE-MEXT collabo:
  - StonyBrook (HPE/Cray Ookami)
  - LANL App Performance, FleCSALE
  - LLNL LBANN, MFEM, Spack, SW4
  - SNL App Performance, ATDM Kokkos, SuperContainers, Trilinos
  - ORNL Jeff Vetter's group



#### MEXT Fugaku Program: Fight Against COVID19 Fugaku resources made available a year ahead of general production (more research topics under international solicitation, also joined US-lead COVID-19 High Performance Computing Consortium)



### **Medical-Pharma**

Prediction of conformational dynamics of proteins on the surface of SARS-Cov-2



GENESIS MD to interpolate unknown experimentally undetectable dynamic behavior of spike proteins, whose static behavior has been identified via Cryo-EM

#### ((Yuji Sugita, RIKEN)

## Fragment molecular orbital calculations for COVID-19 proteins



Large-scale, detailed interaction analysis of COVID-19 using Fragment Molecular Orbital (FMO) calculations using ABINIT-MP

(Yuji Mochizuki, Rikkyo University)



Large-scale MD to search & identify therapeutic drug candidates showing high affinity for COVID-19 target proteins from 2000 existing drugs

(Yasushi Okuno, RIKEN / Kyoto University)



## Host genetic analysis for severe COVID-19

Whole-genome sequencing of severe cases of COVID-19 and mild or asymptomatic infections, and identify riskassociated genetic variants for severe disease

(Satoru Miyano, Tokyo Medical and Dental University)

## **Societal-Epidemiology**

*Prediction and Countermeasure for Virus Droplet Infection under the Indoor Environment* 

Massive parallel simulation of droplet scattering with airflow and hat transfer under indoor environment such as commuter trains, offices, classrooms, and hospital rooms



(Makoto Tsubokura, RIKEN / Kobe University)

#### Simulation analysis of pandemic phenomena

Combining simulations & analytics of disease propagation w/contact tracing apps, economic effects of lockdown, and reflections social media, for effective mitigation policies

(Nobuyasu Ito, RIKEN)



提供:理研・豊橋技科大・神戸大,協力:京工繊大・阪大24

## RIKEN

## **Can Masks Defensively Prevent Infection?**

#### Can masks prevent virus droplets / aerosols from entering the body?

- 3-D Model the upper respiratory tract, simulate how many droplets enter the
- Deep breathing (6 seconds inhale/exhale), simultaneous nose / mouth breathing
- Uniform distribution of droplets / aerosols of varying sizes
- Colors show the size of droplets 色は飛沫のサイズを表す(Red: 100 microns Blue: 1 micron)



- stop at nasal and oral cavities
- Most large droplets are captured by masks, but the number of aerosol droplets reaching the larynx and below are the same irrespective of masks

## Masks effective to a certain degree, but ventilation is as important to disperse aerosols



2-0-2

K Y O T O INSTITUTE OF TECHNOLOGY

KOBE

大阪大学

KAJIMA

DAIKIN

## Virus Droplet Simulation & Mitigation - Research, Government, Industry Collaboration





## **Timey Simulations and Media Dissemination**

 We have been staging multiple press conferences on the latest research results

RIKEN

- Extremely high interest from the media, with immediate national news coverage
- Most people in Japan have seen the Fugaku COVID19 news, esp. droplet simulation, with high trust in being scientifically grounded
- Visualization extremely effective in raising public understanding & awareness of COVID19 & its mitigation
- Prime Minister Suga holds a press conference 22 Nov., urging everyone to wear masks even during group dining, as "it's effectiveness has been proven by a supercomputer (Fugaku)".



## Fugaku as the Centerpiece of Society 5.0

RIKEN

(background slides: https://www8.cao.go.jp/cstp/english/society5\_0/index.html)





# Where do we go from here? Five pillars of research at R-CCS for future system & Society5.0

- 1. Further S&T grand challenges R&D + future leadership IT
- 2. Driving "Simulations First" Society 5.0

R

- 3. The <u>science of convergence</u> of 'first-principle simulations', 'empirical AI methods', and 'big data instrumentation' on large scale HPC systems
  - 1. 'Ad-hoc' integration => foundational 'science of computing'
- 4. **Broadening** of workload analysis and increased generality of HPC to broad IT ecosystem
  - 1. HPC technology fundamental to future IT, from IoT to Clouds
  - 2. Benchmarking & analyzing such workloads for acceleration
- 5. Platform to investigate <u>new computing paradigm</u>
  - 1. Large scale simulation of quantum, neuromorphic, ...

#### Alliances with Science of Computing by Computing for Computing C domestic and overseas אוא אוא universities and research institutes including other **Riken R-CCS** research centers in Science for computing RIKEN International core research center in the Alliance with other scientific disciplines that science of high performance computing contribute to the evolution of HPC (HPC) Development of new electronic devices - and new materials to make them a reality -- to enable new concepts of computing, New computer Acceleration of such as photonic, neuromorphic, quantum, and architectures and computation utilizing new reconfigurable devices New algorithms Analysis and computational models computing technologies simulation to develop and programing models for new devices new computing technologies Synergies and 21 research Integration Science of computing Science by computing teams + Foundational research on computing technologies Research utilizing HPC to address issues in basic essential for HPC science and of public concern **5** ops units Development of new computing technologies, architectures, and Research utilizing analysis and simulation with high resolution and high algorithms toward the "post-Moore" era fidelity in life sciences, engineering, climate and environment, disaster (more to come) Research on programing methods, software, and prediction and prevention, material sciences, space and particle operational technologies physics, and social sciences Development of methodologies to handle big Development of machine learning applications data and Al for the coming Society 5.0 Fostering of human Alliances with industry resources in computational science One of Riken's 13 Research Centers

For details https://www.r-ccs.riken.jp/en/

R-CCS



Science by Computing

= Application Sciences



Area Lead Mitsuhisa Sato (Deputy Director) Performance Modeling & Instrumentation => New PI public call

K. Sato

Programming

Environment

M. Sato (Area Lead)

**High Performance** 

**Big Data Systems** 



Molecular

T. Nakajima

Applications

M. Tsubokura

**HPC Engineering** 

Science

Data

0

Advanced Processor Architectures K. Sano

Next Gen **High Performance** Architecture M. Kondo



Quantum Physics S. Yunoki

> **Field Theory** Y. Aoki



**Biophysics** Y. Sugita



**Operational and** 

**Computer Technologies** 

**Division(& research)** 

Area Lead Kengo Nakajima (Deputy Director)



0

**Discrete Event** Simulation N. Ito

Climate Science H. Tomita

**Disaster Mitigation** & Reduction S. Oishi

Operations and

Development

T. Tsukamoto

New/Merger Apr. 1, 2021 HPC- and Al-driven Drug **Development Platform Division** 

Intelligence Yasushi Okuno (Division Leader) Facility

**Biomedical Computational** 

System **Operations &** Development A. Uno



(new) Software Development Technology Unit F. Shoji (Interim)

Mitsunori Ikeguchi

Al-driven Drug Discovery Collaborative Yasushi Okuno



C

High Performance Al Systems S. Matsuoka => New PI public call

Parallel Numerical

Technology

T. Imamura







Teruki Honda

Molecular Design Computational Intelligence



Advanced Operations Technologies K. Yamamoto



HPC (&Cloud) Infrastructure

## Exploring and Merging Different Routes to O(100,000s) Nodes Deep Learning

GPU 1 Layer 2

and

 $A_1$ 



graph-based Non-intrusive partitioning strategy for large DNN models achieving superlinear scaling [1] AIST, Koc U.

Host Device Device Device

RIKEN



KARMA: Out-of-core distributed training (pure data-parallel) outperforming SoTA NLP models on 2K GPUs [2] AIST, Matsuoka-lab, RIKEN Time

ZeRO A ZeRO + KARMA

HIDDEN = 42 LAYERS = 78 PARAM. = 17

1024 GPUs

Turing-NLG



Layer-wise loop splitting accelerates CNNs [6] Matsuoka-lab, ETH Zurich

MocCUDA: Porting CUDA-based Deep Neural Network Library to A64FX and (other CPU arch.) RIKEN, Matsuoka-lab, AIST

 $A_1$  $\mathbf{A}_1, \mathbf{G}_2, \nabla E_2$  $A_2, G_3, \nabla E_3$  $\mathbf{A}_2$  $A_2, G_3, \nabla E_3$  $A_2^{-1}, G_3^{-1}, \nabla E_3$ Model-parallel (K-FAC) Data-parallel A model-parallel 2nd-order method (K-FAC) trains ResNet-50 on 1K GPUs Model-parallelism in 10 minutes [4] enables 3D CNN training TokyoTech, NVIDIA, RIKEN, AIST on **2K GPUs** with 64x

 $\mathbf{A}_0, \mathbf{G}_1, \nabla E_1$ 

 $\mathbf{A}_1, \mathbf{G}_2, \nabla E_2$ 

 $\mathbf{A}_0^{-1}, \mathbf{G}_1^{-1}, \nabla E_1$  $\mathbf{A}_1^{-1}, \mathbf{G}_2^{-1}, \nabla E_2$ 

Stage 2  $\mathbf{A}_0, \mathbf{G}_1, \nabla E$ 

 $A_1, G_2, \nabla E_2$ 

 $\mathbf{A}_2, \mathbf{G}_3, \nabla E$ 

 $\mathbf{A}_0, \mathbf{G}_1, \nabla E$ 

Data-parallel Model-parallel Data-parallel distribution Layer-wise and inverse-free design further accelerate K-FAC [5]

R-CCS

UT Austin, UChicago, ANL

#### Merging Theory and Practice

Porting High Performance CPUbased Deep Neural Network Library (DNNL) to A64FX chip Fujitsu, RIKEN, ARM Performance Foundation



[1] M. Fareed et al., "A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs", Submitted to PPoPP21

[2] M. Wahib et al., "Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA", ACM/IEEE SC20 (Supercomputing 2020)

[3] Y. Oyama et al., "The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism," arXiv e-prints, pp. 1–12, 2020.

[4] K. Osawa, et al., "Large-scale distributed second-order optimization using kronecker-factored approximate curvature for deep convolutional neural networks," Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 12351–12359, 2019.

larger spatial size

better convergence [3]

Matsuoka-lab, LLNL, LBL, RIKEN

Engineering for

[5] J. G. Pauloski, Z. Zhang, L. Huang, W. Xu, and I. T. Foster, "Convolutional Neural Network Training with Distributed K-FAC," arXiv e-prints, pp. 1-11, 2020.

[6] Y. Oyama et al., "Accelerating Deep Learning Frameworks with Micro-Batches," Proc. IEEE Int. Conf. Clust. Comput. ICCC, vol. 2018-September, pp. 402–412, 2018.

# HPC and AI Convergence for Society 5.0 Manufacturing [Tsubokura et. al., R-CCS]



- Combining ML/Deep Learning, Data Assimilation, Multivariate Optimization with Simulation for new generation manufacturing
- Use output of high-resolution simulation data to train AI
  - Construct AI surrogate model training on simulation data, allowing real-time CFD to facilitate designer-engineer collaboration, multivariate design optimization, etc.
  - Use NN to derive reduced simulation model, allowing digital twin in cyberspace corresponding to entities in physical space for real-time interactions



## Example: Intelligent Genomics/Proteomics Drug Design based on Convergence [New HPC/AI pharma division @ R-CCS]





HPC (&Cloud) Infrastructure

#### Performance projection of many-core CPU systems based on IRDS roadmap

Predictions based on the IRDS Roadmap(2020 ed.), extrapolation of traditional many core architectures relying merely on advances of semiconductor technologies will achieve only 1.8EFLOPS Peak (3.37x c.f. Fugaku), if a machine with broad applicability will be built

From NGACI white paper

総I/O 性能

Storage

(TB/s)

34

3.45

34

3.45

30MW

#### Methodoogies(CPU part): Assumptions from IRDS Roadmap Systems and Architectures

- Cores/socket=70 cores
- SIMD width=2048-bit x 2
- Clock frequency=3.9GHz
- Socket TDP = 351W
- System assumptions
  - System Power=30, 40, 50MW
  - PUE=1.1
  - CPU power occupy=60,70,80%



#### 70% 80% 60%70% 80% 60% 60% 46620 5439062160 62160 72520 82880 77700 Socket $3.3 \times 10^{6}$ $4.4 \times 10^{6}$ $3.8 \times 10^{6}$ $4.4 \times 10^{6}$ $5.1 \times 10^{6}$ $5.8 \times 10^{6}$ $5.4 \times 10^{6}$ $6.3 \times 10^{6}$ Cores 815 1267 13589501086 1086 1448DDR総 102160182120137137171BW (PB/s) HBM 総 307 358410 410 478547512BW (PB/s) DDR 総容量 17 202323273129(PB) HBM 総容量 4 $\mathbf{5}$ $\mathbf{5}$ 7 7 $\mathbf{5}$ 6 (PB) Injection 1.61.61.61.61.61.61.6BW(Tb/s)

40MW

**50MW** 

70%

90650

1584

200

598

34

8

1.6

34

3.45

80%

103600

 $7.3 \times 10^{6}$ 

1810

228

683

39

9

1.6

34

3.45

(EBytes) 最もアグレッシブなシステム構成(50MW電力バジェット、 CPUで80%電力消費)においても1.8EF程度の性能と予測

34

3.45

34

3.45

34

3.45

34

3.45

34

3.45

https://sites.google.com/ view/ngaci/home

**NGACI** white paper

Many Core Era

Post Moore Cambrian Era



Flops-Centric Monolithic Algorithms and Apps

Flops-Centric Monolithic System Software



~2025

**M-P** Extinction

Event

Hardware/Software System APIs Flops-Centric Massively Parallel Architecture



Loosely Coupled with Electronic Interconnect

Transistor Lithography Scaling (CMOS Logic Circuits, DRAM/SRAM) Cambrian Heterogeneous Algorithms and Apps

Cambrian Heterogeneous System Software

Hardware/Software System APIs "Cambrian" Heterogeneous Architecture



Novel Devices + CMOS (Dark Silicon) (Nanophotonics, Non-Volatile Devices etc.)

## RIKEN

## **Post-Moore Algorithmic Development**

### • Towards 2030 Post-Moore era

- End of ALU compute (FLOPS) advance
- Disrupritve reduction in data movement cost with new devices, packaging
- Algorithm advances to reduce the computational order (+ more reliance on data movement)
- Unification of BD/AI/Simulation towards data-centric view

#### Categorization of Algorithms and Their Doamains Fujitsu 2021 present day

- "New problem domains require new computing accelerators"
- In practice challenging, due to algorithms & programming





## Our Project: Exploring versatile HPC architecture and system software technologies to achieve 100x performance by 2028

#### Problems to be solved and goals to be achieved

- General-purpose computer architectures that will accelerate a wide range of applications in the post-Moore era have not yet been established.
- What is a feasible approach for versatile HPC systems based on bandwidth improvement?
- **Goal:** to explore architectures that can achieve 100x performance in a wide range of applications around 2028

