current position：Home>Internal information of customized CPU of Tesla Dojo supercomputer
Internal information of customized CPU of Tesla Dojo supercomputer
2021-08-24 15:53:46 【Shared by: Lu Kai】
Tesla held their AI day , Reveals the internal operation of their software and hardware infrastructure . Part of this disclosure was previously announced Dojo Artificial intelligence training chip . Tesla claims their D1 Dojo The chip has GPU Level of computing power ,CPU Level of flexibility , And network switches IO. A few weeks ago , We guess that the package of this system is a fan out system integrated on TSMC's wafer （InFO_SoW）. We explained the benefits of this type of encapsulation , And the cooling and power consumption involved in this large-scale training chip . Besides , We estimate that the performance of this package will exceed Nvidia System . All this seems to be valid speculation based on disclosure . today , We will study in more depth the specific details of the revealed semiconductors .
Let's first talk about assessing infrastructure . Tesla continues to retrain and improve their neural networks . They evaluate any code changes , See if there's any improvement . Thousands of the same chips are deployed in cars and servers . They run millions of assessments every week .
Tesla has been expanding its business for years GPU Cluster size . If Tesla stops all the actual workload , function Linpack, And submit it to Top500 list , Their current training cluster will become the fifth largest supercomputer . This expansion of performance is not enough for Tesla and their ambitions , So they started developing their own chips a few years ago , namely Dojo project . Tesla needs more performance to achieve greater performance 、 More complex neural networks , And with high efficiency 、 Low cost way .
Tesla's architecture solution is a distributed computing architecture . When we heard their details , This architecture seems to be related to Cerberus Very similar . Here we analyze Cerebras Wafer Scale Engine And its architecture . Every AI training architecture is laid out like this , But the calculation element 、 The details of the network and structure vary greatly . The biggest problem with these types of networks is to expand bandwidth and keep low latency . In order to expand to a larger network , Tesla pays special attention to the latter two . This affects every part of their design, from chip structure to packaging .
Functional units are designed to be used in 1 Through... In a clock cycle , But big enough , So that synchronization overhead and software will not dominate the problem . therefore , They came up with a design almost identical to Cerebras Exactly the same as . A mesh structure consisting of a single unit connected by a high-speed structure , stay 1 The communication route between functional units is completed within one clock . Each individual unit has a large 1.25MB Of SRAM A scraper and a plurality of blades have SIMD Superscalar function CPU kernel , And matrix multiplication units that support all common data types . Besides , They also introduced a new technology called CFP8 New data types for , You can configure the floating point 8. Each unit can achieve 1TFlop Of BF16 or CFP8,64GFlops Of FP32, And every direction 512GB/s The bandwidth of the .
CPU It's not a fuel-efficient light , It has... On the vector pipe 4 A wide ,2 A wide . Each core can carry 4 Threads , To maximize utilization . Unfortunately , Tesla uses a custom ISA, Not based on RISC V Such open source ISA above . This customized ISA Transpose is introduced 、 collect 、 Broadcast and link traversal instructions .
this 354 The complete chip of three functional units reaches BF16 or CFP8 Of 362TFlops and FP32 Of 22.6TFlops. It has a total of 645mm^2 and 500 100 million transistors . Each chip's TDP Reached an amazing 400W. This means that the power density is higher than Nvidia A100 GPU Most configurations . Interestingly , Tesla achieved per millimeter ^2 The effective transistor density is 7750 Ten thousand transistors . This is higher than all other high-performance chips , Only by mobile chips and apple M1 beat .
Another interesting aspect of the basic functional unit is NOC Router . It is associated with Tenstorrent Very similar ways to expand inside and between chips . The link is our analysis of the architecture . Tesla has reached a similar framework with other well-known AI startups , That's not surprising .Tenstorrent Pay great attention to large-scale training , Tesla is also paying great attention to this problem .
On the chip , Tesla has amazing 10TBps Directional bandwidth , But this number doesn't make much sense in the actual workload . And Tenstorrent comparison ,Tesla A big advantage of is that the bandwidth between chips is significantly higher . They have 576 individual SerDes,112GTs. That's what happened 64Tb/s or 8TB/s The total bandwidth of .
We're not sure where Tesla got what they said about each edge 4TB/s, It's more likely to be X The numbers on the axis and Y The number on the axis . Don't mind this confusing slide , The bandwidth of this chip is crazy . The highest known external bandwidth chip is 32Tb/s Network switching chip . Tesla can pass a lot of SerDes And advanced packaging to double it .
Tesla take Dojo The computing plane of the chip is connected with the interface processor , The interface processor passes through PCIe 4.0 Connect to the host system . These interface processors can also achieve higher radial network connectivity , Supplement the existing computational plane mesh structure .
25 individual D1 The chip is packaged into " Fan out wafer process ", It's called training tile . Tesla did not confirm that this package is a fan out system integrated on TSMC's wafer （InFO_SoW）, As we guessed a few weeks ago , But considering the crazy inter chip bandwidth and the fact that they specifically say fan out wafers , It seems highly likely that .
Tesla has developed a proprietary high bandwidth connector , The off chip bandwidth between these tiles is preserved . Every tile has an impressive 9 PFlops Of BF16/CFP8 and 36 TB/s Off chip bandwidth . This is far more than Cerebras Off chip bandwidth , And the expansion ability of Tesla system even exceeds Tenstorrent Architecture and other extended design .
Power transmission is unique 、 custom , It is also extremely impressive . With so much bandwidth and more than 10KW Power consumption of , Tesla has made innovations in power transmission , And transport vertically . The custom voltage regulator is directly refluxed to the fanout wafer . Power Supply 、 Heat and machinery are directly connected with tiles .
Even if the total power of the chip itself is only 10KW, It seems that the power of the whole tile is also 15KW. Power transmission 、IO And wafer lines are also consuming a lot of power . Power enters... From the bottom , And the heat comes out of the top . The chip is not Tesla's unit of scale ,25 A chip tile is . This chip far exceeds... In unit performance and scalability Nvidia、Graphcore、Cerebras、Groq、Tenstorrent、SambaNova Or any other start-up for AI training .
All this seems to be distant Technology , But Tesla claims that they have been on a real artificial intelligence network in the laboratory 2GHz Run at the same speed .
The next step in scaling to thousands of chips is at the server level .Dojo It was extended to 2 ride 3 Tile configuration , There are two such configurations in a server cabinet . For those who calculate at home , Each cabinet has a total of 12 A tile , Each cabinet has a total of 108 PFlops. Each server cabinet has more than 10 10000 functional units 、40 10000 custom cores and 132GB Of SRAM, This is a shocking number .
Tesla is in their mesh , Continue to further expand the scale to the cabinet level . The bandwidth between chips is not subdivided . This is a homogeneous network composed of chips , With crazy bandwidth . They plan to expand the scale to 10 Two cabinets and 1.1 Exaflops.1,062,000 A functional unit ,4,248,000 Kernel ,1.33TB Of SRAM.
Software is interesting , But we won't study them too deeply today . They claim , They can virtually subdivide . They say , Regardless of cluster size , Software can be in Dojo processing unit （DPU） Seamless expansion between .Dojo The compiler can take care of fine-grained parallelism , And map the network across the hardware computing plane . It can achieve this through the parallelism of data model diagrams , But it can also be optimized to reduce memory consumption .
The parallelism of the model can cross the boundaries of the chip , Easily unlock the next level of artificial intelligence model with trillions of parameters , Even more . You don't even need a lot of batch processing . They don't have to rely on handwritten code to run the model on this large-scale cluster .
Connect all this , And Nvidia GPU The cost is quite ,Tesla Claim that they can achieve 4 The performance of The Times , Per watt 1.3 The performance of The Times , as well as 5 Times the floor area . Tesla's TCO advantage ratio Nvidia Our AI solutions are nearly an order of magnitude . If what they say is true , Tesla has surpassed everyone in the field of artificial intelligence hardware and software . I'm skeptical , But this is also the dream of a hardware freak .
author[Shared by: Lu Kai],Please bring the original link to reprint, thank you.
The sidebar is recommended
- Automatic cars have "three prohibitions"? Many novices don't know yet. It's not too late to know
- Nissan's new Jinke is on the market. The whole series doesn't give skylights. No wonder it's hard to sell
- Once the king Baojun 510, please revive your majesty
- EULA good cat GT Mulan version 138000 yuan; After Weilai's accident, Xiao Peng modified the auxiliary driving description
- The car has a fashionable shape and high appearance value? At present, the maximum discount of Xingtu TX is 2000 yuan, starting from 127900 yuan
- The final drawing of 2022 Geely boyue x is exposed! Equipped with 1.5t/2.0t engine, is it another popular model?
- The official real vehicle map of tank 600 finally appeared
- Static comparison Elantra vs Qin plus, family cars in their early 100000's, which one is more worth choosing?
- Case sharing of BMW 7 Series oxygen sensor replacement and rear shock absorber replacement
- Black materials continue, marketing overturns, does anyone dare to buy this brand?
guess what you like
Don't regret buying it! Two cars you can't drive around! GAC motor ga6 vs boyue
If you choose a class B car with 200000 hands, will you consider Volkswagen maiteng?
What are you thinking? This wave is not bad! Two cars you can't drive around! Ruicheng CC vs mingjue HS
Heart 4: male 1 female 4 hand in hand, 3 details confirm that they have already been together, or they will face 7-digit compensation
More than 100000, buy a joint venture medium-sized SUV, save fuel and have a large space! That's a good deal. They sell cars at a loss every minute?
SAIC one week Li Qiang investigated Yanfeng interior decoration Jinqiao factory, and Zhiji L7 ultra-low wind drag coefficient was certified
Waiting for you to control! A pragmatic choice with a monthly salary of 8000! Baojun rc-6 vs Southeast DX7
After 10 days of listing, 6076 units were sold, one circle larger than the famous map. I regret buying Langyi
New Tiguan family awakens slow life
It is more affordable than Dihao GS, the cross-border shape is more beautiful than Honda xr-v, and the fuel consumption is 6L
- Toyota is determined to overthrow Volkswagen. The new car is more beautiful than Binzhi, with CVT fuel consumption of 5.8l
- This wave is not bad! A monthly salary of 8000 is easy to earn! Mingjue HS vs Langyi
- Which is more worth buying, Cadillac xt4 or Audi Q3?
- Pay attention to the wheel difference and blind area in motor vehicles | e-book on rural traffic safety (serial)
- Benda rock300 limestone was officially listed at a price of 23880 yuan
- Destined to be hot! Invincible existence at the same level! Maiteng vs accord
- Don't dare to fight against the benchmark at the same level! An annual salary of 200000 is easy to get in the bag! Maiteng vs accord
- Refit the Yamaha xsr900 fast tracker that can be easily restored
- It's a long face this time! The monthly salary is 8000. Don't you come and find out! Qichen T70 vs song Pro
- Buy is earn! Big guys haunt! Langyi is its loser! Chang'an cs55 vs Xiangyu
- Seventh generation Elantra: engine technology analysis, turbine model is more fuel-efficient than self-priming version!
- It's hard to turn over again! The new Peugeot 408 is applied for domestic production. The appearance is changed without changing the dressing. It takes 1.6T power
- Brother Lin Zhixuan rollover? Connotation Zhao Wenzhuo and Li Xiang lied and robbed Huang Guanzhong's room, which was criticized as selfish
- The fuel consumption is really not high. After driving 5088 kilometers, the third generation Haval H6 owner said so
- The wide body design of the new Honda Civic street shot has a strong sense of movement, which is a benchmark for lingdu
- Courage is commendable! Ford released an official announcement to accelerate the auxiliary driving whose layout has been questioned repeatedly
- 400000 Hitti Audi A6L, driving 12409 km, the owner said several shortcomings
- No car is more valuable than it. It is spacious and comfortable. It is the benchmark of medium-sized SUV
- The arrival of divine beast is full of science and technology. It is the first to take a real shot of Harvard's new SUV divine beast
- Be cautious when buying a car. The four joint venture brands with backward technology, especially the last one, are now facing elimination
- Which is more fuel-efficient, the old driver's "golden right foot" and the vehicle's "constant speed cruise"?
- New Audi A6 unveiled! The appearance is more dynamic and the interior has an explosion of technology! Don't miss it!
- Pure imported 7-seat SUV, one lap larger than Prado, equipped with 4WD
- Another luxury car has fallen down. It is called "Bentley" after listing, and its momentum is comparable to that of Maybach
- The 8 anti-human designs in the car all want to cry. Netizen: the designer is probably drunk
- The performance monster in dress, 550 HP + 3.1m wheelbase, can also meet the performance and comfort
- In 2021, the new Chinese drama with the highest score was born, and two episodes were broadcast, reaching 9.6 points
- The SUV that is difficult to surpass is 2.0T + 6at, up to 4870mm, and the price is close to the people
- 41 year old Korean actress Kim so Yeon is very flirtatious! Snowy jade looks are too sexy
- How can a female star leave her tea break skirt? It looks thin and covers the meat. No wonder Yu Shuxin and Yang Caiyu love it
- Once known as Anita Mui's successor, she has experienced many ups and downs. Now she learns to live a comfortable life
- Information is working? Sinopec will build 1000 hydrogenation stations and 5000 replacement stations in 2025
- Prince Charming's attack Mustang white special edition released
- Do you have an active air power kit? Spy photos of Porsche 911 GT3 RS
- The real picture of tank 600 is exposed, and the price of 3.0tv6 + 9at powertrain is 300000?
- Lin Zhixuan was scolded for not letting go of the air conditioner. What's the connotation of microblogging? Zhao Wenzhuo responded to the rescue
- After the audience waited for three years, Peng Yu, the first brother of Internet University, made another move
- No Xiali after that? Is the victory of Bo county the victory of new energy?
- Tough guy style reveals spy photos of off-road interior of Great Wall gun
- Live shooting of Harvard beast! The size is one circle larger than Haval H6. Is it the next popular model by visual inspection?