The only AI chip unicorn in the West, confronting Nvidia head-on

Produced by Tiger Sniffing Technology Team

Author｜Utada

Head Image｜Visual China

Starting from the fountain pool beside the A38 road in Bristol, you need to spend less than 20 minutes to “ride” out of the CBD of this southwestern British city and enter almost in rows. The outskirts of English bungalows, bushes and rivers.

Yes, even if Bristol is a veritable center of southwestern England, judging from the scale of the city, it is still given a very fresh and refined name by many Chinese students-“Bu Village”. (“Except for London, everyone else is a village.”)

However, after contacting the chip industry now, we suddenly discovered that this ancient British town actually hides one of the most powerful semiconductor industry clusters in the UK.

The picture shows Nvidia’s R&D center in Bristol. After acquiring the British semiconductor company Icera in 2011, Nvidia took root in Bristol and invested tens of millions of pounds to build new factories and laboratories here

In 1972, the famous Fairchild Semiconductor in Silicon Valley (the founders of Intel and AMD came from this company) made an important decision to enter the European market-setting up an office in Bristol. Since then, this small town in the west of the UK has opened up a global vision for the semiconductor industry.

And 6 years later, the microprocessor company Inmos, which was born in Bristol and occupied 60% of the global SRAM market in the 1980s, received up to 200 million pounds of investment from the Callahan government and Thatcher government. Finally created the British semiconductor infrastructure and ecosystem with Bristol’s core, convening a large number of like XSemiconductor super elites such as David May, the founder of MOS Semiconductor and a famous British computer scientist.

“In fact, Bristol has always been a major IT town in the UK. It forms a triangle with surrounding Swindon and Gloucester, and is called the “Silicon Valley” of Europe. If semiconductor companies set up R&D centers in Europe, Bristol is usually the first choice. For example, world-class giants such as Nvidia, HP, Broadcom, and Qualcomm have offices in Bristol.”

A practitioner who knows the European semiconductor industry told Huxi that many people are impressed by Cambridge because of ARM, but historically, Bristol is actually the British chip design center.

“Huawei also has an R&D center in Bristol.”

Just like in the 1950s, 8 talented “traitors” left Fairchild Semiconductor to create companies such as Intel, AMD, and Teradyne, and only then became Silicon Valley today, just as the talented engineers of Bristol were not willing to stay In the “past”-at the “critical point” where the controversy over the failure of Moore’s Law has reached a climax and the artificial intelligence and computing structures have changed, no one does not aspire to be the leader of that era.

After graduating from Cambridge University, an engineer named Simon Knowles set foot on the land of Bristol for the first time in 1989 and accepted a chip design job from memory company Inmos.

In the next 20 years, from the leader of a dedicated processor team within Inmos to one of the founders of two semiconductor companies Element 14 and Icera, Knowles has almost witnessed the peak and decline of Moore’s Law The whole process. Fortunately, the two companies that Knowles participated in, with a total valuation of more than $1 billion, were acquired by Broadcom and Nvidia in 2000 and 2011, respectively.

Without any surprises, this genius semiconductor designer and serial entrepreneur continued to start anew in 2016 and founded a new semi-conductor with Nigel Toon, another genius semiconductor engineer.Conductor design company actively responds to the chip architecture innovation opportunities triggered by artificial intelligence market demand.

Yes, this company just announced the completion of 222 million financing on December 29, 2020 (this financing also makes the company’s balance sheet have 440 million US dollars in cash), and its valuation has reached 2.77 billion US dollars. Graphcore, an artificial intelligence acceleration processor designer called one of Nvidia’s biggest rivals by foreign media.

It should be noted that it is also the only unicorn in the field of Western AI chips.

The picture is Graphcore’s IPU processor

Western private equity and venture capital have been very cautious about semiconductor projects, because they are highly capital-intensive and cannot predict the initial investment returns. As Knowles admitted in an interview: “Compared with the software industry that can try on a small scale without success and change another hole, if a chip design fails, the company has almost no choice but to spend all the money.”

Therefore, until after 2018, as the possibility of artificial intelligence commercialization continues to be promoted and amplified, investors have determined that they can see the return prospects from the trend of “artificial intelligence large-scale computing driving chip structure changes” .

So, Graphcore, which received more than US$80 million in investment in 2017, has successively received US$200 million and US$150 million in venture capital in 2018 and 2020, respectively.

It should be noted that, in addition to Bosch and Samsung that have participated in the A round of investment, Sequoia Capital is the lead investor of Graphcore’s C round, while Microsoft and BMW i Venture Capital have become the lead investor of its D round of financing; < /p>

The main participants in the E round of financing are the non-industrial fund-the Ontario Teachers Pension Plan Committee of Canada, led the investment. Fidelity International and Schroder Group also joined this round of financing.

As you can see from the investors, Graphcore’s industry investors are basically divided into three industry partiesToward-cloud computing (data center), mobile devices (mobile phones) and cars (autonomous driving). Yes, these are the three industries that were first “invaded” by artificial intelligence technology.

Image from Crunchbase

Industry seems to increasingly reach such a consensus that in the future, there is a need for a low-level innovative company like ARM in the era of mobile devices. In addition to the hope of selling hundreds of millions of chips, it can also promote artificial intelligence and various The deep integration of the industry will eventually reach tens of billions of ordinary consumers.

From a product point of view, Graphcore has produced a relatively eye-catching work in 2020-launching the second-generation IPU-M2000 chip, which is mounted on a computing platform called the IPU Machine platform. In addition, Poplar, the software stack tool supporting its chip, has also been updated simultaneously.

“Teaching a computer how to learn, and teaching a computer to do math problems are two completely different things. To improve the’understanding’ of a machine, the underlying driver focuses on efficiency, not speed.” Graphcore CEO Nigel Toon The development of a new generation of AI chips is regarded as a “a once-in-a-lifetime opportunity.”

“Any company that can do this can share the power to decide on the innovation and commercialization of artificial intelligence technology in the coming decades.”

Focusing on Nvidia’s “soft underbelly”

No AI chip design company does not want to kill Nvidia, which has a market value of $339.4 billion. In other words, no company does not want to make a better artificial intelligence accelerator product than GPU.

Therefore, in the past 5 years, chip design companies of all sizes have tended to compare PPT with NVIDIA’s T4, V100, and even the recently released “strongest product” A100 with their own enterprise chip products. , To prove that your processor has better computing efficiency.

Graphcore is no exception.

They also believe that because the previous generation of microprocessors, such as the central processing unit (CPU) and graphics processing unit (GPU) are not specifically designed for artificial intelligence related work, the industry needs a new chip architecture , To cater to new data processing methods.

Of course, this statement is not a mere conjecture of the stakeholders.

We cannot ignore the increasing noise from academia and industry on GPUs-with the rapid increase in the diversity of artificial intelligence algorithm training and reasoning models, GPUs were not designed for artificial intelligence at the beginning of their birth Exposing the areas that he is “not good at”.

“If you are only doing convolutional neural networks (CNN) in deep learning, then GPU is a good solution, but the longer the network is, the more complex it is. GPU has been unable to satisfy AI developers. A bigger appetite.”

An algorithm engineer pointed out to Huxi that the reason why GPU is fast is that it is naturally capable of processing tasks in parallel (for the interpretation and characteristics of GPU, please refer to the article “Kill NVIDIA”). If the data has “sequence” and cannot be parallelized, then the CPU has to be used.

“In many cases, since the hardware is fixed, we will find a way to change the sequential data from the software layer into parallel data. For example, in the language model, the text is continuous, relying on a kind of’mentor-driven’ The training mode can be converted to parallel training.

But certainly not all models can do this. For example, ‘reinforcement learning’ in deep learning is not suitable for GPUs, and it’s difficult to find a parallel way. “

From this point of view, many people in the academic circle even shouted the phrase “GPU hinders the innovation of artificial intelligence”, which is not sensational.

The 4 Development Threads of Deep Learning, Mapping: Utada

“Deep Learning”, this last 10 yearsOne of the fastest growing branches in the field of machine learning, with the rapid development of neural network models and the wide range of types, it is difficult to keep up with the pace of complex computing by relying on the “one’s power” of the GPU hardware alone.

Graphcore replied with a more detailed answer. They believe that the other branches of deep learning that exclude CNNs, especially Recurrent Neural Networks (RNN) and Reinforcement Learning (RL), have restricted the research fields of many developers.

For example, Deepmind, a British AI company that used reinforcement learning to make Alpha Dog, paid attention to Graphcore early because of GPU computing limitations. Its founder, Demis Hassabis, eventually became an investor in Graphcore.

“When many developers in the product department of an enterprise give their requirements (especially data indicators for latency and throughput) to the computing platform department, they usually refuse to say,’GPUs are currently not enough to support such low latency and such High throughput’.

The main reason is that the GPU architecture is more suitable for computer vision (CV) tasks with high-density data such as ‘static image classification and recognition’, but model training with sparse data is not the best choice.

Algorithms in fields such as “Natural Language Processing” (NLP) related to text, on the one hand, there is not so much data (sparse), on the other hand, such algorithms need to pass data multiple times during the training process and quickly Give periodic feedback in order to provide a context that facilitates understanding of the context for the next training. “

In other words, this is a training process in which data continues to flow and circulate.

Just like “Guess you like” on the Taobao interface, after “learning” your browsing and order data on the first day, feedback not too much experience to the algorithm for correction, the second day and the third day And every day in the future, you will learn and give feedback constantly, and you will become more aware of your product preferences.

This type of task, such as the BERT model proposed by Google in 2018 to better optimize user search, is one of the excellent and far-reaching RNN models, and it is also a type of task that Graphcore mentioned “GPUs are very bad at “. In order to solve this kind of problem, there are still many companies using a lot of CPU for training.

CPU and GPU architecture comparison

Fundamentally, this is actually determined by one of the biggest bottlenecks in the current chip operating system-how to transfer data from the memory module to the Logic operation unit, and does not cost so much power consumption. After entering the era of data explosion, unlocking this bottleneck becomes more and more urgent.

For example, in October 2018, the model volume of BERT-Large was still 330 million parameters. By 2019, the model volume of GPT2 has reached 1.55 billion (both are natural language processing models). It can be said that the impact of the amount of data on the underlying hardware of the system to the upper-level SaaS service cannot be underestimated.

A traditional GPU or CPU can of course perform multiple consecutive operations, but it needs to “access registers or shared memory first, and then read and store intermediate calculation results.” This is like going to the outdoor cellar to pick up the stored ingredients, and then returning to the indoor kitchen for processing. Back and forth, it will undoubtedly affect the overall efficiency and power consumption of the system.

Therefore, the core idea of the product architecture of many semiconductor startups is to make “memory closer to processing tasks to speed up the system”-near storage and computing. This concept is not new, but there are very few companies that can make real things.

And what exactly did Graphcore do? Simply put, it is “changing the way the memory is deployed on the processor.”

On an IPU processor about the size of a small soda cake, apart from the integration of 1216 processing units called IPU-Core, the biggest difference from GPUs and CPUs is the large-scale deployment. On-chip memory”.

In short, SRAM (static random access memory) is dispersed and integrated next to the computing unit, and external storage is discarded, which minimizes the amount of data movement. The goal of this method is to break through the memory bandwidth bottleneck by reducing the load and storage quantity, greatly reducing data transmission delay, and reducing power consumption.

IPU architecture

Because of this, in the training tasks of some specific algorithms, since all models can be stored in the processor, after testing, the speed of IPU can indeed reach 20-30 times that of GPU.

For example, in the field of computer vision, in addition to the well-known and widely used residual network model ResNets (which fits well with GPU), the image classification model EfficientNet and ResNeXt models based on grouped convolution and deep convolution are also gradually emerging Research field.

One characteristic of “packet convolution” is that the data is not dense enough.

So, Sujeeth, a Microsoft machine learning scientist, used Graphcore’s IPU to do an image classification training based on the EfficientNet model. The final result is that IPU completed an image analysis of a chest X-ray sample of new coronary pneumonia in 30 minutes, and this workload usually takes 5 hours to complete with a traditional GPU.

Through the test

However, just as the popularity of GPUs and the widespread application of ResNets, the mainstream algorithm model in the field of computer vision, complement each other, the key to determining whether Graphcore succeeds or fails is also “specific.”

As the vice president of sales and general manager of China of Graphcore pointed out in an interview with Tiger Sniff:

On the one hand, their products are indeed more suitable for deep learning tasks with sparse data and high precision requirements in the training market, such as recommendation tasks related to natural language processing. This is also where Alibaba Cloud and Baidu are willing to cooperate with them One of the important reasons.

On the other hand, the new model that has just become popular in the field of computer vision is the direction IPU is trying to “conquer”, and many previous models are still GPUs.

In addition, the GPU createdCuda, a powerful software ecosystem, is less likely to be destroyed than hardware (Cuda is also explained in detail in the article “Kill Nvidia”) , and this wall is the key to developing industry influence .

There is no doubt that Graphocore’s foundation in this aspect is still relatively shallow, so in addition to regular operations, they chose to make some relatively bold attempts based on the programming software Poplar.

For example, they opened the source code of the computing library PopLibs in their developer community, allowing developers to try to describe a new convolutional network layer. The target of this layer is GPU’s cnDNN and cuBLAS, but NVIDIA has not opened them.

In order to pay tribute to the open source community, Poplar v1.4 adds full support for PyTorch. This smart move will help simplify people’s acceptance and help attract wider community participation.

In addition, in order to be able to open the market as soon as possible, Graphcore did not follow the laboratory sales route of “playing competitions to increase industry visibility”, but directly pushed IPU into the industry, to knock on server integrators one by one , Cloud vendors and other customers.

“The AI industry itself, whether it is algorithm iteration or model change, is actually very fast. Some cloud vendors have complained that a certain processor runs a certain model with very good performance, but the model has slightly changed Once changed, the performance that came out was shocking.”

Luo Xu, Chief Technology Officer of Graphcore China, believes that although the market is advocating a lot of ASIC (special-purpose chips) and FPGA (programmable chips), versatility is still the first condition for the industry to consider chips, especially the Internet Vendor.

“Internet manufacturers have a lot of applications, and each application will have a different applicable model. If a processor can only adapt to one model, then customers cannot introduce this processor for mass promotion.”

And “is the programming environment friendly”, that is, the kind of power contributed by Nvidia Cuda, is the second key procurement indicator.

“Now customers generally use AI frameworks to design models, such as Google’s TensorFlow, Facebook’s PyTorch, etc. They will consider whether the upper-level SDK of this processor can be easily integrated into the framework, and whether the programming model Easy to use.

Customers may have some operator-level optimizations and need to do some custom operators. Is the custom operator developed?Convenience also depends on the friendliness of programming. “

If customers still care about anything, of course it is product performance.

Whether it is cloud vendors, server vendors, or developers who purchase computing power through cloud services, they will test the performance of multiple models running on the chip.

“If they mainly value NLP (Natural Language Processing) models, they may focus on testing BERT during performance testing. If they value computer vision, then they may focus on testing some classic computer vision models during performance testing.

In general, customers need to make a comprehensive assessment from the above dimensions before they can decide whether to use this processor, or in other words, they must determine how much benefit this processor can bring to them. “

In this regard, whether it is Nvidia, Graphcore’s IPU or other manufacturers’ dedicated chips, all have their own best models. It can only be said that they have their own merits, and they must not be generalized.

Winner takes all, no longer exists

From the product benchmark test indicators and promotional highlights given by Graphcore, this company is looking for nails with a hammer, and is striving to expand the application scenarios that IPU is good at, so that the IPU architecture can maximize efficiency.

In other words, Graphcore may have a share of Nvidia, but it will never replace them.

Just as the meaning of the word “specific” is limited, the artificial intelligence training and reasoning chip market, due to the diversity and complexity of models, will certainly be able to accommodate more chip companies including Nvidia and Graphcore.

Nigel Toon also admitted that artificial intelligence computing will give birth to three chip vertical markets:

The relatively simple and small dedicated accelerator market, such as mobile phones, cameras and other smart devices An IP core of ;

Another example is the ASIC chip suitable for certain functions in the data center. Specific problems are solved in detail, super large-scale data Central operators (cloud vendors) will have a lot of opportunities in this market;

The last one is the programmable AI processor, which is the market where the GPU is located. There will be more companies in this market, and more innovations in the future will surely generate a larger share.

CPU will continue to exist, and GPU will continue to innovate. They are indispensable or the best choice for certain AI computing tasks. However, the new markets spawned by trends such as the failure of Moore’s Law, AI computing and data explosion must be huge and diverse. It is precisely because of diversity that it gives more specialized chip companies new opportunities.

Therefore, chip startups such as Cerebras, Groq, SambaNova Systems, and Mythic AI have been able to raise hundreds of millions of dollars in funding. Intel also invested in Untether AI, which reforms the AI chip architecture this year. Many people have already given such predictions-a new generation of “Apple” and “Intel” may be born in the artificial intelligence computing market.

At the moment when software has not kept pace with hardware, this means that fierce competition has just begun.

domeet webmaster