好吧,承认被打脸,神威-太湖之光,93P

来源:百度文库 编辑:超级军网 时间:2024/04/30 13:39:37
http://www.top500.org/news/china ... 3-petaflop-machine/

很多架构细节,虽然不确定准确

http://www.top500.org/news/china ... 3-petaflop-machine/

很多架构细节,虽然不确定准确

这还有什么赌约么?
top500的list能下载么?

我这里提示让注册,然后又收不到验证邮件(163)


太湖之光 信息           http://www.netlib.org/utk/people ... way-report-2016.pdf

太湖之光 信息           http://www.netlib.org/utk/people ... way-report-2016.pdf
Each processor connects to four 128-bit DDR3-2133 memory controllers, with a memory bandwidth of 136.51 GB/s



---------------------------------------------------------------------------------------------
Power Efficiency       

The peak power consumption under load (running the HPL benchmark) is at 15.371 MW or 6 Gflops/W. This is just for the processor, memory, and interconnect network.

The cooling system used is a closed-coupled chilled water cooling with a customized liquid water-cooling unit.

---------------------------------------------------------------------------------------------
The Interconnect
Sunway has built their own interconnect. There is a five-level integrated hierarchy, connecting the computing node, computing board, super-nodes, cabinet, to the complete system. Each card has two nodes, see figure 6.

Nodes are connected using PCI-E 3.0 connections in what’s called a Sunway Network

[top500文章]
The interconnect, simply known as the Sunway Network, is also a homegrown affair. It’s noteworthy that the older Sunlight BlueLight machine employed QDR InfiniBand for the system network. The TaihuLight one, however, is based on PCIe 3.0 technology, and provides 16 GB/second of node-to-node peak bandwidth, with a latency of around 1 microsecond. Running MPI communications over it slows that down to about 12 GB/second. Such performance is pretty much on par with EDR InfiniBand or even 100G Ethernet, although the latency seems a tad high (it depends on exactly what’s being measured, of course). In any case, it looks like the design team opted for simplicity here, rather than breakneck speeds using exotic technology.


---------------------------------------------------------------------------------------------
Power Efficiency       

The peak power consumption under load (running the HPL benchmark) is at 15.371 MW or 6 Gflops/W. This is just for the processor, memory, and interconnect network.

The cooling system used is a closed-coupled chilled water cooling with a customized liquid water-cooling unit.

---------------------------------------------------------------------------------------------
The Interconnect
Sunway has built their own interconnect. There is a five-level integrated hierarchy, connecting the computing node, computing board, super-nodes, cabinet, to the complete system. Each card has two nodes, see figure 6.

Nodes are connected using PCI-E 3.0 connections in what’s called a Sunway Network

[top500文章]
The interconnect, simply known as the Sunway Network, is also a homegrown affair. It’s noteworthy that the older Sunlight BlueLight machine employed QDR InfiniBand for the system network. The TaihuLight one, however, is based on PCIe 3.0 technology, and provides 16 GB/second of node-to-node peak bandwidth, with a latency of around 1 microsecond. Running MPI communications over it slows that down to about 12 GB/second. Such performance is pretty much on par with EDR InfiniBand or even 100G Ethernet, although the latency seems a tad high (it depends on exactly what’s being measured, of course). In any case, it looks like the design team opted for simplicity here, rather than breakneck speeds using exotic technology.
74.15%        efficient        (peak        at        125        Pflop/s)

国产芯片+逆天性能+优良功耗 该怎么喷它?在线等,挺急的
260核,牛逼


缺点:

The HPCG performance at only 0.3% of peak performance shows the weakness of the Sunway TaihuLight architecture with slow memory and modest
interconnect performance. The ratio of floating point operations per byte of data from memory on the SW26010 is 22.4 Flops(DP)/Byte transfer, which shows an imbalance or an overcapacity of floating point operations per data transfer from memory. By comparison the Intel Knights Landing processor with 7.2 Flops(DP)/Byte transfer. So for many “real” applications the performance on the TaihuLight will be no where near the peak performance rate. Also the primary memory for this system is on low side at 1.3 PB (Tianhe-2 has 1.4 PB and Titan has .71 PB).

缺点:

The HPCG performance at only 0.3% of peak performance shows the weakness of the Sunway TaihuLight architecture with slow memory and modest
interconnect performance. The ratio of floating point operations per byte of data from memory on the SW26010 is 22.4 Flops(DP)/Byte transfer, which shows an imbalance or an overcapacity of floating point operations per data transfer from memory. By comparison the Intel Knights Landing processor with 7.2 Flops(DP)/Byte transfer. So for many “real” applications the performance on the TaihuLight will be no where near the peak performance rate. Also the primary memory for this system is on low side at 1.3 PB (Tianhe-2 has 1.4 PB and Titan has .71 PB).
robin4268 发表于 2016-6-20 16:34
国产芯片+逆天性能+优良功耗 该怎么喷它?在线等,挺急的
面子工程,劳民伤财,利用率低,软件全是进口
自主SW64指令集

Shenwei-64 Instruction Set (this is NOT related to the DEC Alpha instruction set)
散热还进口,丢人啊,华强北居然不能解决
拣技术细节翻译一下

China Tops Supercomputer Rankings with New 93-Petaflop Machine

Michael Feldman, June 20, 2016, 8:34 a.m.

A new Chinese supercomputer, the Sunway TaihuLight, captured the number one spot on the latest TOP500 list of supercomputers released on Monday morning at the International Supercomputing Conference (ISC) being held in Frankfurt, Germany.  With a Linpack mark of 93 petaflops, the system outperforms the former TOP500 champ, Tianhe-2, by a factor of three. The machine is powered by a new ShenWei processor and custom interconnect, both of which were developed locally, ending any remaining speculation that China would have to rely on Western technology to compete effectively in the upper echelons of supercomputing.

TaihuLight is currently up and running at the National Supercomputing Center in the city of Wuxi, a manufacturing and technology hub, a two-hour drive west of Shanghai. The system will be used for various research and engineering work, in areas such as climate, weather & earth systems modeling, life science research, advanced manufacturing, and data analytics. Center director Prof. Dr. Guangwen Yang, will formally introduce the system on Tuesday afternoon, in a session at ISC.

“As the first number one system of China that is completely based on homegrown processors, the Sunway TaihuLight system demonstrates the significant progress that China has made in the domain of designing and manufacturing large-scale computation systems,” Yang told TOP500 News.

The supercomputer was developed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC), the same organization that designed TaihuLight’s predecessor, the Sunway BlueLight system, which is installed at the National Supercomputing Center in Jinan. BlueLight is a 796-teraflop supercomputer, which was deployed in 2011. It currently resides at number xx on the TOP500 list.

BlueLight is powered by an older version of the ShenWei processor, a third-generation 16-core chip, known as the SW1600, which tops out at about 140 gigaflops. In the five years since that system came online, NRCPC developed a much more powerful processor, the SW26010, a 260-core chip that can crank out just over 3 teraflops. TaihuLight has a single SW26010 in each of its 40,960 nodes, which adds up 125 peak petaflops across the entire machine (more than 10 million cores). Linpack, of course, is going to leave some FLOPS on the table, but 93 petaflops represents a respectable 74 percent yield of peak performance.
SW26010是一个260核芯片,计算指标3T flops,THL(太湖之光)的全部40960个节点,每节点配置一个SW26010,总计指标125p,当然这是纯理论指标,然而,即使计入了74%的效率,最终的Linpack依然有93p

At 3 teraflops, the new ShenWei silicon is on par with Intel’s “Knights Landing” Xeon Phi, another manycore design, but one with a much more public history. In a bit of related irony, it was the US embargo of high-end processors, such as the Xeon Phi, imposed on a number of Chinese supercomputing centers in April 2015, which precipitated a more concerted effort in that country to develop and manufacture such chips domestically. The embargo probably didn’t impact the TaihuLight timeline, since it was already set to get the new ShenWei parts. But it was widely thought that Tianhe-2 was in line to get an upgrade using Xeon Phi processors, which would have likely raised its performance into 100-petaflop territory well before the Wuxi system came online.
和大I  xeon phi的对比,同场竞技毫不逊色

Like its earlier incarnations, this latest ShenWei is a 64-bit RISC processor, with SIMD instruction support and out-of-order execution. Its underlying architecture is somewhat of a mystery, although it’s been speculated that the design was derived from the DEC Alpha architecture. The instruction set is specified simply as ShenWei-64.
神威处理器信息寥寥,缺乏架构细节

The processor is divided into four core groups, each with 64 computing processing elements (CPE) and a management processing element (MPE). Each core group also includes a memory controller delivering an aggregate memory bandwidth of 136.5 GB/second on each socket. As one might expect of a manycore design, it runs at a relatively modest 1.45 GHz and supports just a single execution thread per core. The chip was manufactured at the National High Performance Integrated Circuit Design Center, in Shanghai. The process technology node has not been revealed.
处理器的内核被分为四个内核组,每64个计算内核和一个管理内核归为一组,每组拥有自己的内存控制器,全片共计136.5GB/s的内存带宽,主频1.45G,每内核单线程,(设计单位云云。。。)

Memory-wise, each node contains 32 GB, adding up to a little over 1.3 PB for the whole machine. While that seems like a lot, it’s not much memory considering the number of cores it must feed. The much smaller 10-petaflop K supercomputer at RIKEN, for example, is outfitted with 1.4 PB of memory, and most of the other large systems on TOP500 list have much better bytes-to-FLOPS ratios than that of TaihuLight. It also relies on the older DDR3 technology, which is slower and more power-hungry than the newer DDR4 memory.
每节点配置了32G 内存,全机总计1.3P内存,尽管数字看上去很惊人,但是这这个数字似乎远远低于处理器的需求,比它规模小的多的来自于RIKEN的K supercomputer,仅有10p计算能力却配置了1.4PB的内存,top500榜单上大多数hpc都配置远比THL多的内存,而且THL配置的仅仅是过时的DD3 内存而非更新的DDR4

The system is also rather light on cache. In fact, it really doesn’t have any in the L1-L2-L3 sense. Each core is allocated 12 KB of instruction cache, along with 64 KB of local scratchpad. And that’s it. The scratchpad can be used like a level 1 cache to some degree, but without the L2 and L3 levels to buttress it, there’s not a whole lot of capability to speed up memory accesses.
(THL)系统在缓存配置上也很节省,事实上,它似乎没有三级缓存的概念,每核心拥有一个12kb的指令缓存,和64kb的数据暂存区(scratchpad),这就是全部的cache配置。scratchpad应该能如同一级缓存工作,但是没有二级/三级缓存的支持,数据暂存区对整个内存访问速度不会有很大的加成

From a power standpoint though, TaihuLight is quite good. It draws 15.3 megawatts (MW) running Linpack, which, somewhat surprisingly, is less power than its 33-petaflop cousin, Tianhe-2, which uses 17.8 MW. TaihuLight’s energy-efficiency of 6 gigaflops/watt is excellent, which will certainly earn it a place in the upper reaches of the Green500 list. Keep in mind though, if the system had a more reasonable amount of memory for its size, it would draw significantly more power and its energy efficiency would suffer accordingly.
从功耗基数来看,THL非常不错,运行linpack的功耗是15.3MW,这甚至低于它的兄弟,33p的天河2,后者的功耗基准是17.8mw。THL的6Gflops/W的能耗指标非常优秀,足以令其打入Green 500排行榜。不过要注意,如果系统配置的内存规模达到主流水平,将会大幅抬升整机的功耗

The interconnect, simply known as the Sunway Network, is also a homegrown affair. It’s noteworthy that the older Sunlight BlueLight machine employed QDR InfiniBand for the system network. The TaihuLight one, however, is based on PCIe 3.0 technology, and provides 16 GB/second of node-to-node peak bandwidth, with a latency of around 1 microsecond. Running MPI communications over it slows that down to about 12 GB/second. Such performance is pretty much on par with EDR InfiniBand or even 100G Ethernet, although the latency seems a tad high (it depends on exactly what’s being measured, of course). In any case, it looks like the design team opted for simplicity here, rather than breakneck speeds using exotic technology.
节点间互联,被命名为SWN,也是自主研发的,它的前任神威-蓝光使用了标准的InfiniBand,而THL,基于pci-e3.0,达到了点对点16GB/s的带宽,延时一微秒,上层交换的速度是12GB/s,这个性能已经和EDR infiniBand乃至100G 以太网看齐,尽管延时似乎稍高(当然,这还取决于度量方式),看上去这又是设计团队宁可选择简化设计也不愿导入外部技术

Likewise, for the operating system. The Sunway Raise OS, as it’s called, uses standard Linux as the base, along with the necessary tweaks to make it work with the custom TaihuLight architecture. Other parts of the system software are also pretty standard – compilers for C/C++ and Fortran, along with the associated math libraries. All, of course, required ports to the custom ShenWei architecture and instruction set, but presumably much of that development work had already been done for the previous-generation processors.
操作系统,神威RaiseOS,基于linux做了特定裁剪后的一个非常标准的实现,拥有c/c++/fortran编译器,数学计算库,当然,需要绑定一些神威架构专有的本地库,但是想必绝大多数相关开发工作在上一代芯片上就已经完成了。

According to TOP500 author Jack Dongarra, three scientific simulation codes run on TaihuLight have been chosen as Gordon Bell Prize finalists, two of which have managed to reach a sustained performance of 30 to 40 petaflops. The award is bestowed each year on the most noteworthy HPC application, based on “peak performance or special achievements in scalability and time-to-solution on important science and engineering problems.”
根据TOP500榜单的作者jack dongarra的消息,在THL上部署的三个科学模拟代码进入了Gordon Bell奖的候选名单,其中两个的持续性能指标(sustained performance)达到了30p-40p,这个奖每年一度,奖给最有价值的HPC应用代码,基于“在重大科学和工程问题上,创造性能新高峰或者在伸缩性和耗时上做出特别贡献”

In a paper written by Dongarra and published on June 20, he describes these applications and also provides a deep dive into the TaihuLight architecture (upon which much of the information in this article was based). The paper also offers some interesting comparisons to other supercomputers. While Dongarra does have reservations about some elements of the new machine’s design, he concludes: “The fact that there are sizeable applications and Gordon Bell contender applications running on the system is impressive and shows that the system is capable of running real applications and [is] not just a stunt machine.”
在donggarraf6月20号发表的论文中,他描述了这些应用并且深度探讨了THL的架构(本文大部分信息就来自于此),这片论文也开启了一些和其他hpc有趣的比较,在dongarra对机器新架构的某些思路持保留意见的同时,他总结道:“事实就是,那些大规模的应用和Gordon Bell候选应用在这个系统上的运行是相当精彩的,显示了这个系统能够胜任现实的应用而非仅仅是一个特效展示”
史上首个理论浮点计算速度超每秒10亿亿次的“单个”电脑!
内存和互联的指标还是有点低

尤其内存不知道为啥这么省俭


oldwatch 发表于 2016-6-20 17:02
内存和互联的指标还是有点低

尤其内存不知道为啥这么省俭
互联是EDR infiband和100G以太网的水平吧。

CPU的内存带宽的确是硬伤,用的还是DDR3-2133. 并且内存居然这么少。


互联跑平国际主流货架产品吧,离顶级解决方案有差距
考虑到全自主加成也可以了

内存真的就只能呵呵了,总不至于说架构精奇不担心存储墙吧

不过260内核linpack效率跑到70多真是炸裂啊


互联跑平国际主流货架产品吧,离顶级解决方案有差距
考虑到全自主加成也可以了

内存真的就只能呵呵了,总不至于说架构精奇不担心存储墙吧

不过260内核linpack效率跑到70多真是炸裂啊
相关信息。
http://www.netlib.org/utk/people ... way-report-2016.pdf
申威26010,双精度峰值浮点3T 浮点, 260个计算核心, 这个是啥水平,有懂的吗
西门吸血 发表于 2016-6-20 17:08
相关信息。
http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf
我对这个更感兴趣
robin4268 发表于 2016-6-20 17:10
申威26010,双精度峰值浮点3T 浮点, 260个计算核心, 这个是啥水平,有懂的吗
浮点性能是intel KNL的水平
这种排名的数据,是自己报还是“组织”派人来实地测?

或者网上“遥控”测试?
oldwatch 发表于 2016-6-20 17:07
互联跑平国际主流货架产品吧,离顶级解决方案有差距
考虑到全自主加成也可以了
Each CPE Cluster is composed of a Management Processing Element (MPE) which is a 64-bit
RISC core which is supporting both user and system modes, a 264-bit vector instructions, 32 KB
L1 instruction cache and 32 KB L1 data cache, and a 256KB L2 cache. The Computer
Processing Element (CPE) is composed of an 8x8 mesh of 62-bit RISC cores, supporting only
user mode, with a 264-bit vector instructions, 16 KB L1 instruction cache and 64 KB Scratch
Pad Memory (SPM).

这里边的“ with a 264-bit vector instructions” 和“62-bit RISC cores”是不是错误?
论文内容好多

计算核单管线每周期8flops,管理核双管线每周期8flops

拓扑:
每card(板卡)两个节点(插槽)

每个board八个card

每个super-node有32个board(256个节点)

每个机柜4个super-node

累计40个机柜

原文说是super-node内32个节点全连接
看示意图节点间跳数好像不止一跳



The HPCG performance at only 0.3% of peak performance shows the weakness of the  Sunway TaihuLight architecture
with slow memory and modest interconnect performance.
==========
内存/互联果然被吐槽了……

HPCG去年年底的榜单
http://www.hpcg-benchmark.org/cu ... id=155&slid=282

今年的

June 20, 2016
Table 7:
HPCG Performance for Top 10 Systems
Rank Site Computer           Cores          HPL     HPCG    HPCG/HPL  % of Peak   platform
1  Tianhe                          3,120,000    33.86  0.580    1.7%          1.1%          Tianhe 2 NUDT, Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom
2  RIKEN K computer,         705,024      10.51  0.550     5.2%         4.9%           K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
3  Sunway TaihuLight         10,649,600  93.0    0.371     0.4%         0.3%           Sunway TaihuLight System1.45 GHz+ Custom
4  Titan                            560,640      17.59   0.322     1.8%         1.2%           Titan -Cray XK7 , Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x
5  Trinity                          301,056       8.10    0.182     2.3%         1.6%           
6  Mira                             786,432      8.58     0.167     1.9%         1.7%
7  Pleiades                       185,344      4.08     0.156      3.8%         2.7%
8  Hazel Hen                    185,088      5.64     0.138      2.4%         1.9%
9  Piz Daint                      115,984      6.27     0.124      2.0%         1.6%
10Shaheen II                   196,608      5.53     0.113      2.1%         1.6%

The HPCG performance at only 0.3% of peak performance shows the weakness of the  Sunway TaihuLight architecture
with slow memory and modest interconnect performance.
==========
内存/互联果然被吐槽了……

HPCG去年年底的榜单
http://www.hpcg-benchmark.org/cu ... id=155&slid=282

今年的

June 20, 2016
Table 7:
HPCG Performance for Top 10 Systems
Rank Site Computer           Cores          HPL     HPCG    HPCG/HPL  % of Peak   platform
1  Tianhe                          3,120,000    33.86  0.580    1.7%          1.1%          Tianhe 2 NUDT, Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom
2  RIKEN K computer,         705,024      10.51  0.550     5.2%         4.9%           K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
3  Sunway TaihuLight         10,649,600  93.0    0.371     0.4%         0.3%           Sunway TaihuLight System1.45 GHz+ Custom
4  Titan                            560,640      17.59   0.322     1.8%         1.2%           Titan -Cray XK7 , Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x
5  Trinity                          301,056       8.10    0.182     2.3%         1.6%           
6  Mira                             786,432      8.58     0.167     1.9%         1.7%
7  Pleiades                       185,344      4.08     0.156      3.8%         2.7%
8  Hazel Hen                    185,088      5.64     0.138      2.4%         1.9%
9  Piz Daint                      115,984      6.27     0.124      2.0%         1.6%
10Shaheen II                   196,608      5.53     0.113      2.1%         1.6%
oldwatch 发表于 2016-6-20 17:33
论文内容好多

计算核单管线每周期8flops,管理核双管线每周期8flops
知乎上有人说green500排第3.

不知道有没有跑graph500测试啊,这个应该更能说明综合性能吧
国产芯片+逆天性能+优良功耗 该怎么喷它?在线等,挺急的
软件应用跟不上,就是为了面子搞得工程……

五美元请走支付宝,谢谢
HhJjKcScS 发表于 2016-6-20 17:19
这种排名的数据,是自己报还是“组织”派人来实地测?

或者网上“遥控”测试?
自己申报linpack成绩,不申报的机器不会上榜。
实际上一直都有不屑于参加TOP500的机器,要么是出于保密,要么是某些领域的专用系统。
小白求专家解释这货是不是真的超牛逼的
mips64el 发表于 2016-6-20 17:42
知乎上有人说green500排第3.

不知道有没有跑graph500测试啊,这个应该更能说明综合性能吧
GRAPH500 富士通京已经已经一骑绝尘~~
芯片是40nm的吧,和美国用14nm工艺造出来的一样牛逼吗?美国可以去死了
美国真是垃圾,听说还的2年才能造出100p的
1771964382 发表于 2016-6-20 19:40
芯片是40nm的吧,和美国用14nm工艺造出来的一样牛逼吗?美国可以去死了
效率和工艺跟申威1500差不多的龙芯仅仅相当于5年前的inteli5 的水平,这个sw1500是个谜
内存和互联的指标还是有点低

尤其内存不知道为啥这么省俭
据我猜测是1够用了 2设计不足。。。
失落的天堂 发表于 2016-6-20 19:07
自己申报linpack成绩,不申报的机器不会上榜。
实际上一直都有不屑于参加TOP500的机器,要么是出于保密 ...
嗯,去年榜单上好像突然多了一堆曙光(还是神威?)

都是上线之后一直没去报linpack的机器,抱团刷榜,嘎嘎
neohaly 发表于 2016-6-20 20:05
效率和工艺跟申威1500差不多的龙芯仅仅相当于5年前的inteli5 的水平,这个sw1500是个谜
上边不是说和intel最新的knights land差不多么?

neohaly 发表于 2016-6-20 20:05
效率和工艺跟申威1500差不多的龙芯仅仅相当于5年前的inteli5 的水平,这个sw1500是个谜


最占芯片面积/晶体管数量的不是逻辑电路,是各级缓存

顶楼这个片子缓存全部加起来才几M,然后主频也有限

单纯的逻辑电路,对集成度/工艺要求应该有限
neohaly 发表于 2016-6-20 20:05
效率和工艺跟申威1500差不多的龙芯仅仅相当于5年前的inteli5 的水平,这个sw1500是个谜


最占芯片面积/晶体管数量的不是逻辑电路,是各级缓存

顶楼这个片子缓存全部加起来才几M,然后主频也有限

单纯的逻辑电路,对集成度/工艺要求应该有限
oldwatch 发表于 2016-6-20 17:02
内存和互联的指标还是有点低

尤其内存不知道为啥这么省俭
带宽做不上去,容量可能是板子上没空间塞了。
huorang 发表于 2016-6-20 19:03
软件应用跟不上,就是为了面子搞得工程……

五美元请走支付宝,谢谢
支付宝实名有风险,建议走非绑定手机的微信