天河2号的细节已经透露了,使用了INTEL提供的解决方案.. ...

来源:百度文库 编辑:超级军网 时间:2024/04/29 15:50:34


转载自: www.hpcwire.com/hpcwire/2013-06- ... _supercomputer.html

At the end of May, an international group of high performance computing researchers gathered at the International HPC Forum in Changsha, China. One of the talks detailed the specs for the new Tianhe-2 system, which as we reported last week, is expected to rather dramatically top the Top500 list of the world’s fastest supercomputers.
China, supercomputer, Tianhe-2, Tianhe, Tianhe2
Artist's rendering of the system as it will look once finally implemented at its final destination.

As noted previously, the system will be housed at the National Supercomputer Center in Guangzhou and has been aimed at providing an open platform for research and education and to provide a high performance computing service for southern China.

Dr. Jack Dongarra from Oak Ridge National Lab, one of the founders of the Top500, was on hand for the event in China and shared a draft document that offers deep detail on the full scope of the Tianhe-2, which will, barring any completely unexpected surprises, far surpass the Cray-built Titan.

The 16,000-node Inspur-built Tianhe-2 is based on Ivy Bridge (32,000 sockets) and 48,000 Xeon Phi boards, meaning a total of 3,120,000 cores. Each of the nodes sports 2 Ivy Bridge sockets and 3 Phi boards.

According to Dongarra, there are some new and notable LINPACK results:

I was sent results showing a run of HPL benchmark using 14,336 nodes, that run was made using 50 GB of the memory of each node and achieved 30.65 petaflops out of a theoretical peak of 49.19 petaflops, or an efficiency of 62.3% of theoretical peak performance taking a little over 5 hours to complete.The fastest result shown was using 90% of the machine. They are expecting to make improvements and increase the number of nodes used in the test.

This certainly seems to confirm that this will indeed be the top system on this June's list. But let's take a closer look at some architectural elements to put those numbers in context...

Interestingly, each of the Phi boards have 57 cores instead of 61. This is because they were early in the production cycle at the time and yield was an issue. Still each of the 57 cores can boast 4 threads of execution and each thread can hit 4 flops per cycle. By Dongarra’s estimate, the 1.1 GHz cycle time produces a theoretical peak of 1.003 teraflops for each Phi element.

Each of the nodes is laden with 64 GB of memory, each of the Phi elements come with 8 GB of memory for a total of 88 GB of memory per node for a total of full system memory at 1.404 petabytes. There is not a lot of detail about the storage infrastructure, but there is a global shared parallel storage system sporting 12.4 petabytes.

According to Dongarra, there are “2 nodes per board, 16 boards per frame, 4 frames per rack, and 125 racks make up the system.” He says that the compute board has two compute nodes and is composed of two halves—the CPM and APM. The CPM portion of the board contains the 4 Ivy Bridge processors, memory and 1 Xeon Phi board while the CPM half contains the 5 Xeon Phi boards.



compute, node, xeon, tianhe-2, tianhe2, china, supercomputer

There are also 5 horizontal blind push-pull connections on the edge; connections from the Ivy Bridges to each of the coprocessors are made via PCI-E 2, which has 16 lanes and are 10 Gbps each. Dongarra points out that the actual design and implementation of the board is for PCI-E 3.0 but the Phi only supports PCI0E 2. There is also a PCI-E connection to the NIC.



We already knew that this was a system from the Chinese IT company, Inspur. According to Dongarra, “Inspur contributed to the manufacturing of the printed circuit boards and is also contributing to the system installation and testing.” At this point, the system is still being assembled and tested at the National University of Defense Technology before being installed at its permanent home.

As we know from the original Tianhe-1A system, NUDT has been hard at work on their own interconnects. On the TH-2, they are using their TH Express-2 interconnect network, which taps a fat tree topology with 13 switches, each with 576 ports at the top level.

As Dongarra notes, “This is an optoelectronics hybrid transport technology and runs a proprietary network. The interconnect uses their own chip set. The high radix router ASIC called NRC has a 90 nm feature size with a 17.16x17.16 mm die and 2577 pins.”

He says that “the throughput of a single NRC is 2.56 Tbps. The network interface ASIC called NIC has the same feature size and package as the NIC, the die size is 10.76x10.76 mm, 675 pins and uses PCI-E G2 16X. A broadcast operation via MPI was running at 6.36 GB/s and the latency measured with 64K of data within 12,000 nodes is about 85us.



Dongarra says that the 720 square meter footprint means a rather confined space and isn’t optimally laid out. However, this is just temporary since when it arrives in its permanent home in Guangzhou it will be laid out more efficiently, as seen in the artist’s rendering of the system at the top of the article.

The peak power consumption under load for the system is 17.6 MWs, but this is just for the processors, memory and interconnect network. When the closely-coupled chilled water with customized liquid water cooling unit operations are added in, the total consumption is 24 MWs. Dongarra says that it has a high cooling capacity of 80 KW and when installed at its home site, it will use city water as its source. Power load is monitored by a series of lights on the cabinet doors.

For far more details about these and other aspects of the Tianhe-2 system, check out Dr. Dongarra's extensive report...

http://www.netlib.org/utk/people ... dongarra-report.pdf
这是详细的报告,有需要的朋友请下载...


功耗24MW,看来能耗比方面还是很大问题,要知道BGQ做到这个级别功耗可以比这少10MW之多(这还是在POWER BQC使用45nm工艺的前提下),真到了100P的时候功耗都成为限制整套体系发展的瓶颈了。

转载自: www.hpcwire.com/hpcwire/2013-06- ... _supercomputer.html

At the end of May, an international group of high performance computing researchers gathered at the International HPC Forum in Changsha, China. One of the talks detailed the specs for the new Tianhe-2 system, which as we reported last week, is expected to rather dramatically top the Top500 list of the world’s fastest supercomputers.
China, supercomputer, Tianhe-2, Tianhe, Tianhe2
Artist's rendering of the system as it will look once finally implemented at its final destination.

As noted previously, the system will be housed at the National Supercomputer Center in Guangzhou and has been aimed at providing an open platform for research and education and to provide a high performance computing service for southern China.

Dr. Jack Dongarra from Oak Ridge National Lab, one of the founders of the Top500, was on hand for the event in China and shared a draft document that offers deep detail on the full scope of the Tianhe-2, which will, barring any completely unexpected surprises, far surpass the Cray-built Titan.

The 16,000-node Inspur-built Tianhe-2 is based on Ivy Bridge (32,000 sockets) and 48,000 Xeon Phi boards, meaning a total of 3,120,000 cores. Each of the nodes sports 2 Ivy Bridge sockets and 3 Phi boards.

According to Dongarra, there are some new and notable LINPACK results:

I was sent results showing a run of HPL benchmark using 14,336 nodes, that run was made using 50 GB of the memory of each node and achieved 30.65 petaflops out of a theoretical peak of 49.19 petaflops, or an efficiency of 62.3% of theoretical peak performance taking a little over 5 hours to complete.The fastest result shown was using 90% of the machine. They are expecting to make improvements and increase the number of nodes used in the test.

This certainly seems to confirm that this will indeed be the top system on this June's list. But let's take a closer look at some architectural elements to put those numbers in context...

Interestingly, each of the Phi boards have 57 cores instead of 61. This is because they were early in the production cycle at the time and yield was an issue. Still each of the 57 cores can boast 4 threads of execution and each thread can hit 4 flops per cycle. By Dongarra’s estimate, the 1.1 GHz cycle time produces a theoretical peak of 1.003 teraflops for each Phi element.

Each of the nodes is laden with 64 GB of memory, each of the Phi elements come with 8 GB of memory for a total of 88 GB of memory per node for a total of full system memory at 1.404 petabytes. There is not a lot of detail about the storage infrastructure, but there is a global shared parallel storage system sporting 12.4 petabytes.

According to Dongarra, there are “2 nodes per board, 16 boards per frame, 4 frames per rack, and 125 racks make up the system.” He says that the compute board has two compute nodes and is composed of two halves—the CPM and APM. The CPM portion of the board contains the 4 Ivy Bridge processors, memory and 1 Xeon Phi board while the CPM half contains the 5 Xeon Phi boards.



compute, node, xeon, tianhe-2, tianhe2, china, supercomputer

There are also 5 horizontal blind push-pull connections on the edge; connections from the Ivy Bridges to each of the coprocessors are made via PCI-E 2, which has 16 lanes and are 10 Gbps each. Dongarra points out that the actual design and implementation of the board is for PCI-E 3.0 but the Phi only supports PCI0E 2. There is also a PCI-E connection to the NIC.



We already knew that this was a system from the Chinese IT company, Inspur. According to Dongarra, “Inspur contributed to the manufacturing of the printed circuit boards and is also contributing to the system installation and testing.” At this point, the system is still being assembled and tested at the National University of Defense Technology before being installed at its permanent home.

As we know from the original Tianhe-1A system, NUDT has been hard at work on their own interconnects. On the TH-2, they are using their TH Express-2 interconnect network, which taps a fat tree topology with 13 switches, each with 576 ports at the top level.

As Dongarra notes, “This is an optoelectronics hybrid transport technology and runs a proprietary network. The interconnect uses their own chip set. The high radix router ASIC called NRC has a 90 nm feature size with a 17.16x17.16 mm die and 2577 pins.”

He says that “the throughput of a single NRC is 2.56 Tbps. The network interface ASIC called NIC has the same feature size and package as the NIC, the die size is 10.76x10.76 mm, 675 pins and uses PCI-E G2 16X. A broadcast operation via MPI was running at 6.36 GB/s and the latency measured with 64K of data within 12,000 nodes is about 85us.



Dongarra says that the 720 square meter footprint means a rather confined space and isn’t optimally laid out. However, this is just temporary since when it arrives in its permanent home in Guangzhou it will be laid out more efficiently, as seen in the artist’s rendering of the system at the top of the article.

The peak power consumption under load for the system is 17.6 MWs, but this is just for the processors, memory and interconnect network. When the closely-coupled chilled water with customized liquid water cooling unit operations are added in, the total consumption is 24 MWs. Dongarra says that it has a high cooling capacity of 80 KW and when installed at its home site, it will use city water as its source. Power load is monitored by a series of lights on the cabinet doors.

For far more details about these and other aspects of the Tianhe-2 system, check out Dr. Dongarra's extensive report...

http://www.netlib.org/utk/people ... dongarra-report.pdf
这是详细的报告,有需要的朋友请下载...


功耗24MW,看来能耗比方面还是很大问题,要知道BGQ做到这个级别功耗可以比这少10MW之多(这还是在POWER BQC使用45nm工艺的前提下),真到了100P的时候功耗都成为限制整套体系发展的瓶颈了。
这个文章目前来看是非常真实可信的,而且各种数据都很详实。还有半个月时间,看能不能在这半个月时间中把剩下的10%性能跑出来吧。
INTEL来调试的有多少人?
飞腾还是没能更进一步。
壮东风 发表于 2013-6-3 14:35
飞腾还是没能更进一步。
2015年用国防科大的加速卡,应该更象显卡,估计效率还不如xeon  phi   更不要说跟龙芯和SW的向量扩展的效率比了。。。。悲剧,国防科大拼排名去了。。。。
花落庭院 发表于 2013-6-3 14:48
2015年用国防科大的加速卡,应该更象显卡,估计效率还不如xeon  phi   更不要说跟龙芯和SW的向量扩展的效 ...
加速卡用intel我认了,但cpu还是不能挤掉至强,我很憋气。
另外cpu的向量扩展和专业加速卡,还是有区别的吧?
壮东风 发表于 2013-6-3 14:52
加速卡用intel我认了,但cpu还是不能挤掉至强,我很憋气。
另外cpu的向量扩展和专业加速卡,还是有区别的 ...

系统CPU基本都是INTEL或者AMD了,这个是使用环境需要决定的,龙芯3C的超级计算机也会用X86做系统CPU,就linpack效率来说:CPU向量扩展的效率比xeon phi高,比显卡高更多,富士通的SPARC 向量扩展系统效率做到了93%(京系统)。。。。。。。。。。

壮东风 发表于 2013-6-3 14:35
飞腾还是没能更进一步。


计算节点的前端处理器为4096个FT-1500处理器, FT-1500处理器是由国防科技大学为天河1研发,其可以说是天河1项目的最大收获,其为16核心的Sparc V9架构处理器,在40nm工艺情况下运行频率为1.8Ghz,峰值性能为144 GFflops/s,功耗为65W,但相比英特尔22nm 12核 2.2GHz 211Gflops/s性能的Ive Bridge还是有明显差距。

天河2互联方面采用自主研发的 Express-2 内部互联网络,其为有13个交换机,而每个交换机有576个端口。连接介质为光电混合。具体控制器是名为NRC的ASIC专用目的集成电路,其采用90nm工艺,封装尺寸为17.16x17.16 mm,共有2577引脚。单个NRC的吞吐能力为2.56Tbps。而在终端方面网络接口也采用类似结构的NIC,但规模稍小,为10.76x10.76 mm, 675 pin,其采用PCIE 2.0方式连接,传输速率为6.36GB/s。并且在在12000节点的情况下延迟也很低,仅为85us。

壮东风 发表于 2013-6-3 14:35
飞腾还是没能更进一步。


计算节点的前端处理器为4096个FT-1500处理器, FT-1500处理器是由国防科技大学为天河1研发,其可以说是天河1项目的最大收获,其为16核心的Sparc V9架构处理器,在40nm工艺情况下运行频率为1.8Ghz,峰值性能为144 GFflops/s,功耗为65W,但相比英特尔22nm 12核 2.2GHz 211Gflops/s性能的Ive Bridge还是有明显差距。

天河2互联方面采用自主研发的 Express-2 内部互联网络,其为有13个交换机,而每个交换机有576个端口。连接介质为光电混合。具体控制器是名为NRC的ASIC专用目的集成电路,其采用90nm工艺,封装尺寸为17.16x17.16 mm,共有2577引脚。单个NRC的吞吐能力为2.56Tbps。而在终端方面网络接口也采用类似结构的NIC,但规模稍小,为10.76x10.76 mm, 675 pin,其采用PCIE 2.0方式连接,传输速率为6.36GB/s。并且在在12000节点的情况下延迟也很低,仅为85us。


中文版
http://www.enet.com.cn/article/2013/0603/A20130603288400.shtml

中文版
http://www.enet.com.cn/article/2013/0603/A20130603288400.shtml

EKW 发表于 2013-6-3 15:53
计算节点的前端处理器为4096个FT-1500处理器, FT-1500处理器是由国防科技大学为天河1研发,其可以说是天 ...


我不关心你给的资料是不是错误百出,就那个NRC芯片,封装尺寸17X17能引出2577脚?我说你没文化真可能抬举你。。。。。去查查INTEL的SNB处理器    2011个引脚封装尺寸是52.5mmX45mm。。。。。
EKW 发表于 2013-6-3 15:53
计算节点的前端处理器为4096个FT-1500处理器, FT-1500处理器是由国防科技大学为天河1研发,其可以说是天 ...


我不关心你给的资料是不是错误百出,就那个NRC芯片,封装尺寸17X17能引出2577脚?我说你没文化真可能抬举你。。。。。去查查INTEL的SNB处理器    2011个引脚封装尺寸是52.5mmX45mm。。。。。
hswz 发表于 2013-6-3 16:05
中文版
http://www.enet.com.cn/article/2013/0603/A20130603288
链接挂了。。。。。
到2015年,Xeon Phi可能会更新到14nm工艺。届时直接把加速卡一换就能翻倍性能了。
deam 发表于 2013-6-3 16:41
到2015年,Xeon Phi可能会更新到14nm工艺。届时直接把加速卡一换就能翻倍性能了。
14nm工艺频率就上2.2G了?

花落庭院 发表于 2013-6-3 16:31
我不关心你给的资料是不是错误百出,就那个NRC芯片,封装尺寸17X17能引出2577脚?我说你没文化真可能抬 ...


http://www.enet.com.cn/article/2013/0603/A20130603288413_3.shtml
自己进去看,原来某人自己认为水平比国防科大的水平高啊,国防科大不请某人去设计这些芯片真是前无古人的最重大的失误啊
花落庭院 发表于 2013-6-3 16:31
我不关心你给的资料是不是错误百出,就那个NRC芯片,封装尺寸17X17能引出2577脚?我说你没文化真可能抬 ...


http://www.enet.com.cn/article/2013/0603/A20130603288413_3.shtml
自己进去看,原来某人自己认为水平比国防科大的水平高啊,国防科大不请某人去设计这些芯片真是前无古人的最重大的失误啊
花落庭院 发表于 2013-6-3 16:31
我不关心你给的资料是不是错误百出,就那个NRC芯片,封装尺寸17X17能引出2577脚?我说你没文化真可能抬 ...

理论上来说使用BGA封装、锡球触点焊接的方式都很难做到,BGA锡球最小的是0.2mm,2577个触点意味着锡球与锡球间的距离大概只能有0.25mm那么宽,焊接的时候失败的风险太大了,BGA植球时球越小失败率越高,因为焊锡的量太小,热胀冷缩可能都会导致脱焊,只要有1个触点没做好整个CPU就是废品。
17*17mm也就内存颗粒那么大,但内存颗粒背面只有100多个触点,就算这样用专业的BGA焊台进行手工操作也不好做。

PS:龙芯一直使用BGA封装,CPU直接焊在主板上,不过那只是几百个触点而已。
EKW 发表于 2013-6-3 16:48
http://www.enet.com.cn/article/2013/0603/A20130603288413_3.shtml
自己进去看,原来某人自己认为水 ...
呵呵,原来国防科大这么威武,INTEL用2362mm2引出2011个脚,就是整个封装面积都是也不超过2600,国防科大用人家1/8的面积引出接近2600个脚,威武。。。
花落庭院 发表于 2013-6-3 16:45
14nm工艺频率就上2.2G了?

不一定只是增加频率啊。14nm比22nm进化一代半,是Intel的工厂二十年来头一次跨越式换代。
失落的天堂 发表于 2013-6-3 16:51
理论上来说使用BGA封装、锡球触点焊接的方式都很难做到,BGA锡球最小的是0.2mm,2577个触点意味着锡球与 ...
用BGA封装也做不到,没有这个能力。需要出2577个脚,需要封装51X51个点,17/51=0.33mm,你就是最小也要0.5mm,微电子所出的0.2的BGA,40mmX40MM才封装1156个引脚
deam 发表于 2013-6-3 16:57
不一定只是增加频率啊。14nm比22nm进化一代半,是Intel的工厂二十年来头一次跨越式换代。
增加核心数量?
花落庭院 发表于 2013-6-3 16:53
呵呵,原来国防科大这么威武,INTEL用2362mm2引出2011个脚,就是整个封装面积都是也不超过2600,国防科大 ...
不管人家国防科大威不威武,至少人家这款芯片都已经在天河一号上用得好好的,不像某人那样靠嘴在这里喷风凉话
花落庭院 发表于 2013-6-3 16:59
增加核心数量?
有可能。
链接挂了。。。。。
已改链接,
EKW 发表于 2013-6-3 15:53
计算节点的前端处理器为4096个FT-1500处理器, FT-1500处理器是由国防科技大学为天河1研发,其可以说是天 ...
交换芯片好给力的指标,不过怎么给了一个总吞吐阿

端口间交换不知道能做到多少

失落的天堂 发表于 2013-6-3 16:51
理论上来说使用BGA封装、锡球触点焊接的方式都很难做到,BGA锡球最小的是0.2mm,2577个触点意味着锡球与 ...


人家资料写的很明白 Die size:17.16x17.16mm,Package:FC-PBGA,Pin:2577

硅核心(Die)尺寸和封装(Package)尺寸不是一个概念。
失落的天堂 发表于 2013-6-3 16:51
理论上来说使用BGA封装、锡球触点焊接的方式都很难做到,BGA锡球最小的是0.2mm,2577个触点意味着锡球与 ...


人家资料写的很明白 Die size:17.16x17.16mm,Package:FC-PBGA,Pin:2577

硅核心(Die)尺寸和封装(Package)尺寸不是一个概念。

功耗24MW,看来能耗比方面还是很大问题,要知道BGQ做到这个级别功耗可以比这少10MW之多(这还是在POWER BQC使用45nm工艺的前提下)……。


The performance achieved was 30.65 Pflop/s or 1.935 Gflop/Watt.
天河2号一期现在的性能功耗比是1.935 Gflop/W。

The Top 5 systems on the Top 500 list have the following Gflops/Watt efficiency.
上一轮Top 500前5名的性能功耗比如下:




功耗24MW,看来能耗比方面还是很大问题,要知道BGQ做到这个级别功耗可以比这少10MW之多(这还是在POWER BQC使用45nm工艺的前提下)……。


The performance achieved was 30.65 Pflop/s or 1.935 Gflop/Watt.
天河2号一期现在的性能功耗比是1.935 Gflop/W。

The Top 5 systems on the Top 500 list have the following Gflops/Watt efficiency.
上一轮Top 500前5名的性能功耗比如下:

tp5.GIF (31.29 KB, 下载次数: 1)

下载附件 保存到相册

2013-6-3 19:00 上传



FT1500按通用CPU折合5flop每周期,加上那个4simd的说明,不明觉厉,难道CPU内异构?
html 发表于 2013-6-3 17:49
人家资料写的很明白 Die size:17.16x17.16mm,Package:FC-PBGA,Pin:2577

硅核心(Die)尺寸和封装(Pa ...
英文原文的确是核心面积,但14楼的文章把它翻译成封装尺寸了,刚才只顾看14楼的文章没注意这一点。
匿名用户 发表于 2013-6-3 19:00
The performance achieved was 30.65 Pflop/s or 1.935 Gflop/Watt.
天河2号一期现在的性能功耗比是1 ...
1.935 Gflop/W ?这个数据不对吧,天河二号linpack Rmax是30P,功耗24MV。
IBM BGQ在这方面的优势太大了,使用比别人落后2-3代的半导体工艺(45nm VS 22/14nm),却仍然能达到相同的能耗比。
花落庭院 发表于 2013-6-3 16:58
用BGA封装也做不到,没有这个能力。需要出2577个脚,需要封装51X51个点,17/51=0.33mm,你就是最小也要0. ...
14楼的文章翻译错误,英文原文是核心尺寸
失落的天堂 发表于 2013-6-3 20:33
1.935 Gflop/W ?这个数据不对吧,天河二号linpack Rmax是30P,功耗24MV。
IBM BGQ在这方面的优势太大了 ...
没看出BGQ有什么优势,xeon phi有X86指令,2路顺序512bit向量,BGQ是power 指令?2路顺序256bit向量。。。。。。。。。。。。。。。

oldwatch 发表于 2013-6-3 17:42
交换芯片好给力的指标,不过怎么给了一个总吞吐阿

端口间交换不知道能做到多少


这个芯片相比之前国内的产品进步不小,但和国际最强的一些芯片相比还有不少差距,几年前Mellanox公司infiniband FDR交换机用的SwitchX芯片其交换容量就有4、5Tb/s那么大了,据说还不是全球最强的。
oldwatch 发表于 2013-6-3 17:42
交换芯片好给力的指标,不过怎么给了一个总吞吐阿

端口间交换不知道能做到多少


这个芯片相比之前国内的产品进步不小,但和国际最强的一些芯片相比还有不少差距,几年前Mellanox公司infiniband FDR交换机用的SwitchX芯片其交换容量就有4、5Tb/s那么大了,据说还不是全球最强的。
花落庭院 发表于 2013-6-3 20:47
没看出BGQ有什么优势,xeon phi有X86指令,2路顺序512bit向量,BGQ是power 指令?2路顺序256bit向量。。。 ...
是指能耗比方面啦,如果同样用22nm工艺,那估计一片XEON Phi的功耗相当于10+片POWER BQC处理器了,但此时XEON Phi的性能远不如POWER BQC...
失落的天堂 发表于 2013-6-3 20:33
1.935 Gflop/W ?这个数据不对吧,天河二号linpack Rmax是30P,功耗24MV。
IBM BGQ在这方面的优势太大了 ...
24MW是加上水冷系统的功耗后的总功耗。
计算 性能功耗比的时候是只算系统本身功耗的。天河二号是17MW
没看出BGQ有什么优势,xeon phi有X86指令,2路顺序512bit向量,BGQ是power 指令?2路顺序256bit向量。。。 ...

xeon phi有俩缺点:向量单元不兼容avx/sse2;内存太少。

另外它是单路512bit+fma。

deam 发表于 2013-6-3 22:59
xeon phi有俩缺点:向量单元不兼容avx/sse2;内存太少。

另外它是单路512bit+fma。


我说的2路是2指令发射,再你的单路512bit是什么意思?其实xeon phi有点核心内部异构的意思,这个跟BGQ的两个浮点部件直接128bit向量扩展有区别,我给出了架构图可以看出xeon phi有向量和X87是独立分开的,而BGQ 龙芯  power 7 酷睿 富士通 SW等架构图都是浮点部件直接向量扩展,搞清楚phi的寄存器位数和数据通道位数就清楚phi是几路了,在用512bit的寄存器和512bit的数据通道难度很大,不如两个256bit叠加,我没找到phi寄存器数据和数据通道数据,龙芯3B也有512bit向量单元的表述,(我给图了论文的)所以没有找到phi寄存器和数据通道512bit数据时phi的参数我不能乱说。
deam 发表于 2013-6-3 22:59
xeon phi有俩缺点:向量单元不兼容avx/sse2;内存太少。

另外它是单路512bit+fma。


我说的2路是2指令发射,再你的单路512bit是什么意思?其实xeon phi有点核心内部异构的意思,这个跟BGQ的两个浮点部件直接128bit向量扩展有区别,我给出了架构图可以看出xeon phi有向量和X87是独立分开的,而BGQ 龙芯  power 7 酷睿 富士通 SW等架构图都是浮点部件直接向量扩展,搞清楚phi的寄存器位数和数据通道位数就清楚phi是几路了,在用512bit的寄存器和512bit的数据通道难度很大,不如两个256bit叠加,我没找到phi寄存器数据和数据通道数据,龙芯3B也有512bit向量单元的表述,(我给图了论文的)所以没有找到phi寄存器和数据通道512bit数据时phi的参数我不能乱说。


再看看BGQ处理器的架构图和说明:

再看看BGQ处理器的架构图和说明:
失落的天堂 发表于 2013-6-3 21:12
这个芯片相比之前国内的产品进步不小,但和国际最强的一些芯片相比还有不少差距,几年前Mellanox公司infi ...
4TB/s还是4Tbps?有链接吗?
我说的2路是2指令发射,再你的单路512bit是什么意思?其实xeon phi有点核心内部异构的意思,这个跟BGQ的 ...
knights corner是一个512bit的向量单元(相对的,haswell是两个256bit的)。
deam 发表于 2013-6-4 08:02
knights corner是一个512bit的向量单元(相对的,haswell是两个256bit的)。
我没说knights corner不可能是一个512单元,我是说没有足够证据看出是一个,比如使用了512bit向量寄存器,是512bit的数据通道。。。。。相反,龙芯3B和haswell是256bit寄存器,256bit数据通道,haswell和龙芯是FPU向量扩展,knights corner有点感觉是异构,因为核心同时存在VPU和X87单元
花落庭院 发表于 2013-6-4 07:09
4TB/s还是4Tbps?有链接吗?
笔误,是4Tb/s,也就是4Tbps,已经更正。
不过SwitchX这款IC是2009年的东西了,后来mellanox又发布了一款SwitchX-2的产品,虽然没写明参数,但肯定比这款强(按照它往年的产品性能提升速度,估计吞吐量可能有6到8Tbps)。之前还有一款InfiniScale IV,吞吐量2.88Tbps。

注意关键词Single-Chip Implementation
http://www.mellanox.com/page/pro ... mp;mtag=switchx_vpi