current position:Home>Internal information of customized CPU of Tesla Dojo supercomputer

Internal information of customized CPU of Tesla Dojo supercomputer

2021-08-24 15:53:46 Shared by: Lu Kai

Tesla held their AI day , Reveals the internal operation of their software and hardware infrastructure . Part of this disclosure was previously announced Dojo Artificial intelligence training chip . Tesla claims their D1 Dojo The chip has GPU Level of computing power ,CPU Level of flexibility , And network switches IO. A few weeks ago , We guess that the package of this system is a fan out system integrated on TSMC's wafer (InFO_SoW). We explained the benefits of this type of encapsulation , And the cooling and power consumption involved in this large-scale training chip . Besides , We estimate that the performance of this package will exceed Nvidia System . All this seems to be valid speculation based on disclosure . today , We will study in more depth the specific details of the revealed semiconductors .

Let's first talk about assessing infrastructure . Tesla continues to retrain and improve their neural networks . They evaluate any code changes , See if there's any improvement . Thousands of the same chips are deployed in cars and servers . They run millions of assessments every week .

Tesla has been expanding its business for years GPU Cluster size . If Tesla stops all the actual workload , function Linpack, And submit it to Top500 list , Their current training cluster will become the fifth largest supercomputer . This expansion of performance is not enough for Tesla and their ambitions , So they started developing their own chips a few years ago , namely Dojo project . Tesla needs more performance to achieve greater performance 、 More complex neural networks , And with high efficiency 、 Low cost way .

Tesla's architecture solution is a distributed computing architecture . When we heard their details , This architecture seems to be related to Cerberus Very similar . Here we analyze Cerebras Wafer Scale Engine And its architecture . Every AI training architecture is laid out like this , But the calculation element 、 The details of the network and structure vary greatly . The biggest problem with these types of networks is to expand bandwidth and keep low latency . In order to expand to a larger network , Tesla pays special attention to the latter two . This affects every part of their design, from chip structure to packaging .

Functional units are designed to be used in 1 Through... In a clock cycle , But big enough , So that synchronization overhead and software will not dominate the problem . therefore , They came up with a design almost identical to Cerebras Exactly the same as . A mesh structure consisting of a single unit connected by a high-speed structure , stay 1 The communication route between functional units is completed within one clock . Each individual unit has a large 1.25MB Of SRAM A scraper and a plurality of blades have SIMD Superscalar function CPU kernel , And matrix multiplication units that support all common data types . Besides , They also introduced a new technology called CFP8 New data types for , You can configure the floating point 8. Each unit can achieve 1TFlop Of BF16 or CFP8,64GFlops Of FP32, And every direction 512GB/s The bandwidth of the .

CPU It's not a fuel-efficient light , It has... On the vector pipe 4 A wide ,2 A wide . Each core can carry 4 Threads , To maximize utilization . Unfortunately , Tesla uses a custom ISA, Not based on RISC V Such open source ISA above . This customized ISA Transpose is introduced 、 collect 、 Broadcast and link traversal instructions .

this 354 The complete chip of three functional units reaches BF16 or CFP8 Of 362TFlops and FP32 Of 22.6TFlops. It has a total of 645mm^2 and 500 100 million transistors . Each chip's TDP Reached an amazing 400W. This means that the power density is higher than Nvidia A100 GPU Most configurations . Interestingly , Tesla achieved per millimeter ^2 The effective transistor density is 7750 Ten thousand transistors . This is higher than all other high-performance chips , Only by mobile chips and apple M1 beat .

Another interesting aspect of the basic functional unit is NOC Router . It is associated with Tenstorrent Very similar ways to expand inside and between chips . The link is our analysis of the architecture . Tesla has reached a similar framework with other well-known AI startups , That's not surprising .Tenstorrent Pay great attention to large-scale training , Tesla is also paying great attention to this problem .

On the chip , Tesla has amazing 10TBps Directional bandwidth , But this number doesn't make much sense in the actual workload . And Tenstorrent comparison ,Tesla A big advantage of is that the bandwidth between chips is significantly higher . They have 576 individual SerDes,112GTs. That's what happened 64Tb/s or 8TB/s The total bandwidth of .

We're not sure where Tesla got what they said about each edge 4TB/s, It's more likely to be X The numbers on the axis and Y The number on the axis . Don't mind this confusing slide , The bandwidth of this chip is crazy . The highest known external bandwidth chip is 32Tb/s Network switching chip . Tesla can pass a lot of SerDes And advanced packaging to double it .

Tesla take Dojo The computing plane of the chip is connected with the interface processor , The interface processor passes through PCIe 4.0 Connect to the host system . These interface processors can also achieve higher radial network connectivity , Supplement the existing computational plane mesh structure .

25 individual D1 The chip is packaged into " Fan out wafer process ", It's called training tile . Tesla did not confirm that this package is a fan out system integrated on TSMC's wafer (InFO_SoW), As we guessed a few weeks ago , But considering the crazy inter chip bandwidth and the fact that they specifically say fan out wafers , It seems highly likely that .

Tesla has developed a proprietary high bandwidth connector , The off chip bandwidth between these tiles is preserved . Every tile has an impressive 9 PFlops Of BF16/CFP8 and 36 TB/s Off chip bandwidth . This is far more than Cerebras Off chip bandwidth , And the expansion ability of Tesla system even exceeds Tenstorrent Architecture and other extended design .

Power transmission is unique 、 custom , It is also extremely impressive . With so much bandwidth and more than 10KW Power consumption of , Tesla has made innovations in power transmission , And transport vertically . The custom voltage regulator is directly refluxed to the fanout wafer . Power Supply 、 Heat and machinery are directly connected with tiles .

Even if the total power of the chip itself is only 10KW, It seems that the power of the whole tile is also 15KW. Power transmission 、IO And wafer lines are also consuming a lot of power . Power enters... From the bottom , And the heat comes out of the top . The chip is not Tesla's unit of scale ,25 A chip tile is . This chip far exceeds... In unit performance and scalability Nvidia、Graphcore、Cerebras、Groq、Tenstorrent、SambaNova Or any other start-up for AI training .

All this seems to be distant Technology , But Tesla claims that they have been on a real artificial intelligence network in the laboratory 2GHz Run at the same speed .

The next step in scaling to thousands of chips is at the server level .Dojo It was extended to 2 ride 3 Tile configuration , There are two such configurations in a server cabinet . For those who calculate at home , Each cabinet has a total of 12 A tile , Each cabinet has a total of 108 PFlops. Each server cabinet has more than 10 10000 functional units 、40 10000 custom cores and 132GB Of SRAM, This is a shocking number .

Tesla is in their mesh , Continue to further expand the scale to the cabinet level . The bandwidth between chips is not subdivided . This is a homogeneous network composed of chips , With crazy bandwidth . They plan to expand the scale to 10 Two cabinets and 1.1 Exaflops.1,062,000 A functional unit ,4,248,000 Kernel ,1.33TB Of SRAM.

Software is interesting , But we won't study them too deeply today . They claim , They can virtually subdivide . They say , Regardless of cluster size , Software can be in Dojo processing unit (DPU) Seamless expansion between .Dojo The compiler can take care of fine-grained parallelism , And map the network across the hardware computing plane . It can achieve this through the parallelism of data model diagrams , But it can also be optimized to reduce memory consumption .

The parallelism of the model can cross the boundaries of the chip , Easily unlock the next level of artificial intelligence model with trillions of parameters , Even more . You don't even need a lot of batch processing . They don't have to rely on handwritten code to run the model on this large-scale cluster .

Connect all this , And Nvidia GPU The cost is quite ,Tesla Claim that they can achieve 4 The performance of The Times , Per watt 1.3 The performance of The Times , as well as 5 Times the floor area . Tesla's TCO advantage ratio Nvidia Our AI solutions are nearly an order of magnitude . If what they say is true , Tesla has surpassed everyone in the field of artificial intelligence hardware and software . I'm skeptical , But this is also the dream of a hardware freak .

copyright notice
author[Shared by: Lu Kai],Please bring the original link to reprint, thank you.
https://caren.inotgo.com/2021/08/20210824155336973x.html

Random recommended