Chapter -1-

The Transition from General-Purpose to Specialized Chips

A Game Changer for The Semiconductor Industry


2020.04.11  Format for Print

The Eras of General-Purpose and Specialized Chips

General-purpose products by nature are used in many different applications. Consequently, they are produced in large quantity resulting in low cost, which in turn leads to wide adoption. Meanwhile, specialized products, although expensive, offer better performance, quality, and reliability.

In the semiconductor industry general-purpose chips dominate. With an approximate annual revenue of $500 billion and production volume of 2 trillion chips, the average unit price is only around 25 cents.

Even the most advanced chips made in state-of-the-art factories which cost tens of billions of dollars to build sell for only a few dollars. It is therefore a low margin business which relies on volume for profit.

Thanks to the adoption of the von Neumann computer architecture, the industry has been able to generate demand for large volumes of general-purpose chips.

Under the von Neumann architecture, a processor reads the processing procedure and data from memory, processes the data following the processing procedure, and writes the results back to memory. By looping through this process repeatedly in sequence, it is possible to implement processing of any complexity. Meanwhile, by changing the processing procedure (or program), it is possible to perform any kind of processing.

By adopting this architecture, the evolution of the computer has followed a scenario where general-purpose hardware – processors and memory – are produced in large volumes to drive adoption, and software is used to tailor the hardware for various applications. As a result, the focus of the semiconductor business has been on the production of large volumes of processors and memory at low costs. More recently, the rise of big data has resulted in sensors being added as another mass-produced device type.

Competition in such a business is in terms of capital investment. When the business potential of a newly invented device such as DRAM, flash memory, CPU, or GPU is recognized, large capital investments follow. This quickly leads to fierce competition, resulting in industry realignment and eventually consolidation.

Japanese companies won the competition in device innovation but lost the competition in capital investment.

On the other hand, specialized chips had their share of success too - ASIC (Application Specific Integrated Circuit) had a sizable market between 1985 and 2000.

Glue logic which interconnects processors and memory varies from system to system. While it was implemented using a combination of standard logic chips at the beginning, it was later integrated into ASIC to reduce system cost and area.

Another important reason that turned ASIC into a profitable business was the adoption of computer-aided design (CAD) to dramatically reduce both development cost and time. Using CAD, a complicated chip that previously would take 100 engineers a year to design could be designed by 1 engineer in one month.

In the 1980’s, research and development efforts led by the University of California at Berkeley resulted in the creation of automatic layout and logic synthesis technology as well as the birth of chip design tools vendors. Furthermore, a semi-custom manufacturing process was developed where a semi-finished chip was made first like a standard product, which was then tailored towards different applications by customizing the interconnect layers.

Using these design methodology innovations, chip development productivity was improved by three orders-of-magnitude in total.

Nevertheless, with Moore’s Law increasing integration density by three orders-of-magnitude in 15 years, even with computer-aided design, it took more man-hours than ever to develop specialized chips. This contributed to the profit erosion of the ASIC business, which eventually resulted in its demise.

In this manner, an era of general-purpose chips is started by device innovations and ended after fierce competition in capital investment. Meanwhile, an era of specialized chips is started by design methodology innovations and ended by Moore’s Law.

Fig. The energy crisis created by the explosion of data and the slowing of Moore’s Law are driving the resurgence of specialized chips in the form of domain-specific ICs.

A Game Changer: In-house Development of Specialized Chips at GAFA

However, we are now in the middle of a game changing trend – IT giants such as GAFA have embarked on in-house development of specialized chips, since they find it difficult to compete by relying on general-purpose chips procured from dedicated chipmakers the likes of Intel and Qualcomm.

There are three reasons for that.

The first reason is the unique energy crisis faced by data companies. The explosive growth of data and the rising sophistication of AI processing have fueled the energy crisis.

Without further advance in low-power technology, by 2030 IT machines alone would consume about twice the total power output of today, increasing to two hundred times by 2050.

If digital transformation consumes so much energy as to destroy the environment, it will make a sustainable future impossible.

At the beginning, a chip consumed just about 0.1W of power. Under the ideal scaling scenario its price-performance could be improved while its power density remains constant.

In reality, however, as a result of prioritizing price-performance over power, power was allowed to increase to achieve price-performance improvement beyond what was possible under the ideal scaling scenario, resulting in a 1000-fold increase in 15 years and reaching 100W in 2000. Chip power density is now more than 30 times that of a hotplate used for cooking. As a result, it requires a tremendous amount of power to cool a cloud server.

When chip power exceeds the cooling limits, even if we can increase integration density, we cannot power up and use all the transistors at the same time. The more chip power exceeds the cooling limits, the more there are unused transistors. For example, while unused transistors account for about 75% of the total in the 7nm generation, the ratio is expected to grow to 80% in the 5nm generation.

Under such constraints, only those who can improve energy efficiency 10-fold can achieve 10-fold increase in computing performance, or 10-fold increase in smartphone battery life.

Compared to general-purpose chips which can be used to perform every function, specialized chips can achieve a more than 10-fold increase in energy efficiency by eliminating unnecessary circuits.

The second reason for the shift to specialized chips development is the rise of AI. AI in the form of neural networks and deep learning offers a new way of information processing to owners of data.

Similar to our brain, neural networks are based on wired logic where functionality is defined by how the components are wired together. Furthermore, data is processed in a parallel fashion as it flows through the network. Since parallel processing enables lower operating frequency and hence lower chip voltage, wired logic can improve energy efficiency by more than 10-fold compared to von Neumann architecture where data is processed sequentially.

The third reason for the shift to specialized chips development is the adoption of the fabless model by the semiconductor industry. Under this model, pure-play foundries such as TSMC offer manufacturing services to the world, which enables any user adopting a business model of providing superior AI performance to develop their own chips.

For companies offering a hardware platform solution that drives the demand for sufficiently large chip volumes, the fabless model allows them to design specialized chips to more quickly realize chips with higher performance and at lower cost than procuring from dedicated chipmakers.

The Manufacturing Industry in A Knowledge-Based Society

As Alan Kay once pointed out, “People who are really serious about software should make their own hardware.” In system development, it is necessary to think both hardware and software.

The choice of architecture depends on the type of processing performed. Logical and arithmetic processing which requires versatile controllability is better performed using the conventional solution of general-purpose chips implementing the von Neumann architecture. Meanwhile, intuitive and spatial information processing which requires sophisticated AI computation is better performed using neural networks implemented in specialized chips, which, for reasons previously explained, achieve high energy efficiency. With the shift of chip application from products to services, the quest for a new, matching architecture has commenced.

The fact remains that the choice between general-purpose and specialized chips involves tradeoffs between low cost and high performance.

For illustration, let’s look at data communication. Communication infrastructure which does not drive large volumes can be created by adopting general-purpose hardware and implementing unique functionalities using virtualization technology. However, at the edge where there are comparatively large device volumes, specialized chips can be utilized to boost performance to enable distributed processing local to where the data is generated.

The development of specialized chips is knowledge-intensive, not capital-intensive. Development of automatic layout and logic synthesis was previously driven by the University of California at Berkeley. Similarly, university research will play an integral role in creating the fundamental knowledge required for specialized chips development, including knowledge for automatic generation of functionalities and systems.

The 20th century was the century of “general-purpose”. After the war, in a quest for material gratification and economic efficiency, economic growth was driven by mass production of standardized products.

However, as modern society matures, our emphasis has shifted from collective growth to personal fulfillment. This has resulted in the transition from an industrial to a knowledge-based society.

While this transition was spreading from developed to developing countries, Japan was able to enjoy a period of prosperity by continuing to mass-produce standardized products. But eventually when the transition was complete, Japan fell behind other Asian nations due to its focus on manufacturing and hence slow adaptation to the transition.

The 21st century looks to be the century of “specialization”. The center of our value is shifting from being capital-intensive to being knowledge-intensive, from scale to knowledge, from increase in quantity to increase in quality, from material to spirit, from convenience to joy, from products to services, from large volume to large variety, from standardization to individualization, from what everyone can do to what no one else can do.

How should the manufacturing industry adapt to such a shift? It is d.lab’s mission to search for an answer.

Chapter -2-

A Short History of the Brain, Computer, and IC

And One Scenario of the Future


2020.04.11  Format for Print

The Birth of the Brain, Computer, and IC

13.9 billion years ago, a gigantic energy mass suddenly came into being. That is the basis of the Big Bang theory.

Energy and matter interacted (E=mc2), and the universe rapidly expanded. What started as a little tremor created the Galaxy, and the Earth was born 4.6 billion years ago.

While matter transformed following physical laws, life came into being 4 billion years ago, which replicates itself by coding and storing its structure in the form of DNA.

Life, using mutation and survival of the fittest as strategy, survived in an uncertain environment and diversified by evolving from single-celled to multicellular organisms, to plants and animals.

As animals continued to evolve, eventually they developed the brain which is the central nervous system that determines their action based on information they collect from the outside world. Then 7 million years ago, the human brain further advanced, differentiating human being from other mammals.

In order to survive, human being learned the importance of working together. In other words, it is the brain that created society and gave birth to the mind. Furthermore, we developed languages to communicate our intention, in addition to acquiring the ability to think logically.

Mathematics was born 3000 years ago.

Mathematics expanded our cognitive capacity. The four great ancient civilizations used computing tools and principles such as Pythagoras' theorem for tasks like calculating taxes and surveying land. Later in the 5th century B.C. in ancient Greece, rather than its use for computation, the inside world of mathematics became the subject of academic pursuit, and mathematics evolved from being a tool into a way of thinking.

With development such as the advance of algebra in Arabia in the 7th Century and the invention of symbolic algebra during the Renaissance in the 15th Century, mathematics spread without constraints and became ubiquitous. Then in the 17th Century, the development of calculus enabled the inquiry into the world of infinite. As a result of close examination of concepts such as limits and continuity, abstract symbolism was born which reaches beyond subjective intuition.

By the 20th Century, efforts were started to make mathematics about how to perform mathematics. By completely shedding ambiguous ideas such as physical intuition and subjective feeling, the mathematics that flowed out of our brain gave birth to the computer as machines that perform computation.

Early computers were wired logic machines where programming was achieved by changing the wiring between computational units.

This architecture faced two challenges. The first was the challenge of scale where the scale of problems that could be solved was limited by the scale of the hardware. The second was the challenge of wiring where the number of interconnections exploded as the scale of the system rose.

Von Neumann invented the stored program von Neumann architecture where the processor performs computation by first retrieving the data to be processed as well as the commands which govern both the movement and processing of the data that are stored in memory in advance, and then interpreting and executing the commands one after another. This was a revolutionary architecture which uses a single computational unit to execute a different command in each cycle instead of multiple computational units physically connected in a particular way. It overcame the challenge of scale.

Meanwhile, after approaching the challenge of wiring from different angles, the revolutionary solution that emerged was the integrated circuit (IC) invented by Jack Kilby. It overcame the challenge of wiring by using photolithography to integrate multiple building block elements onto the same chip and wiring them together all at once.

The integration and parallelization of simplified and miniaturized computing resources onto an IC resulted in quantum leaps in computational performance. High performance computers in turn enabled the development of even larger scale IC. In this manner, computer and IC performance advanced together driven by Moore’s Law.

Fig. Chip scaling enables computer downsizing, which in turn enables further scaling, resulting in the two advancing in tandem.

The Growth of the IC and Its Limits

The cost-performance of the IC can be exponentially improved by process scaling. The growth scenario of the IC has been driven by the rule of thumb that is known as Moore’s Law.

Since cost is determined by lithography, as lithography approaches its limits, cost per transistor starts to rise. Specifically, cost per transistor started to rise in 2015 in the 16nm generation.

However, with the introduction of EUV in 2019 in the 7nm generation, transistor’s unit cost is expected to fall once again.

But the challenge in overcoming the limits on performance improvement remains. The cause is power consumption and hence heat generation reaching the thermal limits, which prevents further increase in integration density.

It is computational performance per unit power consumed, or power efficiency (GFLOPS/W), which controls the destiny of Moore’s Law. In other words, performance cannot be improved without improving power efficiency.

Power has been increasing as a consequence of scaling. Since transistors operate under electric field effect, if device is scaled while keeping electric field constant, power should remain constant.

But in reality, between 1980s and 90s, device was scaled without proportional reduction in supply voltage in order to achieve additional circuit speedup. Consequently, power increased 4-fold every 3 years, resulting in 3 orders-of-magnitude increase in 15 years.

Although voltage finally started to drop in 1995 when power became exceedingly high, electric field inside the device has become so strong that current is not decreasing adequately. As a result, power has continued to double every 6 years.

Since the power increase is a consequence of scaling, it is not easy to resolve. It requires thinking from first principles.

In electronic devices, information is carried by electrons. In CMOS circuits, the charge used in information processing is expressed as Q=CV, where C is the circuit capacitance and V its supply voltage. The energy associated with the charge is given by E=QV=CV2. Since power is the energy dissipated per second, it is computed by multiplying energy with the switching frequency, resulting in P=fαCV2, where f is the clock frequency and α the switching probability.

Therefore, there are three ways to reduce power – lowering voltage V, capacitance C, and switching frequency .

While lowering voltage is effective in reducing power, it has its limits, imposed by leakage.

Due to quantum effect, current leaks through the insulating gate oxide. Consequently, there is a limit on how much the oxide layer can be thinned. Scaling the transistor without thinning the oxide layer accordingly results in the gate not being able to completely turn off the transistor.

As a result, further reduction in voltage leads to increase in leakage, making leakage current the dominant component that increases overall power. The maximum power efficiency of today’s processors is achieved at an optimal supply voltage of around 0.45V.

To reduce leakage, materials, processes, and structures have been modified. For instance, gate control can be strengthened by wrapping the gate around a transistor built in 3D. Such a structure in the form of FinFET has achieved better than expected leakage reduction in the 7nm generation.

The Transition from General-Purpose to Specialized, from 2D to 3D

At room temperature, the theoretical operating voltage limit for cascaded CMOS gate connection is 0.036V. In other words, there is room for another order-of-magnitude reduction in voltage, which translates into another two orders-of-magnitude reduction in power.

Another way to improve power efficiency is to reduce capacitance. Compared to general-purpose chips such as CPU and GPU, specialized chips like ASIC (Application Specific Integrated Circuit) and SoC (System-on-Chip) can achieve 10-fold improvement in power efficiency by reducing capacitance through the elimination of unnecessary circuits.

Meanwhile, data movement can consume more power than computation. If data needs to be moved in and out of the chip, power consumption can be 3 orders-of-magnitude larger. Therefore, DRAM access which is required by the von Neumann architecture represents a bottleneck in power reduction.

To improve the chip data interface it is important to shift from a peripheral to an array structure. Increase in integration density inside the chip is proportional to the square of the scaling factor. On the other hand, when external I/Os are mainly placed along the perimeter of the chip, increase in I/O density is only linearly proportional to the scaling factor. As a result, increase in data communication performance cannot keep up with the demand of internal processing. An effective solution is to stack chips on each other so they can be connected across their entire surface. In other words, power efficiency can be greatly improved by moving chip integration from 2D to 3D.

This illustrates how the slowing of Moore’s Law offers more and more opportunities for the adoption of disruptive technologies.

Chapter -3-

Scaling Scenario

The Astounding Power of Exponential Growth


2020.04.18  Format for Print

The Ideal Scaling Scenario

The fundamental principle driving the evolution of the IC is scaling, which is the miniaturization of semiconductor devices. It improves chip performance while lowering manufacturing cost by increasing integration density.

Integration density has been increasing 4-fold every 3 years for DRAM, and 2-fold every 2 years for processors. Such growth follows the rule of thumb that is widely known as Moore’s Law.

The manufacturing cost of a chip is computed by dividing the cost to manufacture a wafer by the number of good chips on the wafer.

Device scaling is achieved through improvement to both lithography and process technology. At the same time, wafer size is increased while manufacturing technology is improved to increase yield, resulting in an increase in the number of good chips per wafer.

In the last 50 years, device was shrunk 20% while chip size increased 14% every two years. Together they resulted in doubling (=1.142/0.82) of the number of devices integrated per chip.

For DRAM, additional efforts including adoption of 3D device structure and circuit improvement led to a total of 4-fold increase in integration density every 3 years. Nevertheless, such efforts are approaching their limits. As a result, DRAM scaling is expected to stop in the near future.

Let’s take a closer look at how performance scales. If we reduce the supply voltage V [V] by the same factor of 1/α as is used in shrinking the device dimension x [m] (a 20% shrink corresponds to α=1.25), the electric field inside the transistor [V/m] remains unchanged. Since transistor operates on electric field effect, this “constant electric field scaling” ensures that the transistor performs the same way before and after scaling.

Under this scaling scenario, the current I [A] flowing through the transistor and its capacitance C [F] are also reduced by a factor of 1/α, as explained next.

Since current I is equal to the rate of charge flow [C/s], it is computed by multiplying two parameters - the density of charge across the depth of the channel [C/m] induced by the electric field of the gate voltage, and the speed [m/s] with which charge is driven through the channel by the electric field between drain and source. The charge density in turn is equal to channel capacitance associated with the gate multiplied by the gate voltage [V], where channel capacitance is determined from channel width [m] ÷ gate oxide thickness [m]. Meanwhile, charge moves across the channel with speed determined from the electric field between drain and source, which is equal to drain-to-source voltage [V] ÷ channel length [m].

As a result, current I is proportional to V2/x and hence it scales down by a factor of 1/α. On the other hand, capacitance C is computed from area ÷ thickness and is therefore proportional to x. Consequently, it also scales down by a factor of 1/α.

In summary, each of voltage V [V], current I [A], and capacitance C [F] scales down by the same factor of 1/α. Resistance R(=V/I)therefore remains unchanged, and so circuit delay determined from its RC time constant decreases by 1/α. The fact that RC has a dimension of time can also be derived by combining Q = CV and Q = It (where t = time) to solve for t = Q/I = CV/I = RC.

Now if we compute power density [W/mm2] by multiplying voltage with current and then dividing by area, we can conclude that it remains unchanged through scaling. It may seem that increasing integration density would make it difficult for heat to dissipate. But since there is no change to power density, there is no heat dissipation problem. This is truly an ideal scenario.

The Reality and Its Consequence

However, in reality scaling did not follow this ideal scenario.

Microprocessor operating frequency increased 50-fold in 10 years. Scaling contributed a 13-fold increase, while the remaining 4-fold increase was achieved through architectural improvement.

This translates into a 1.6-fold increase in operating speed every 2 years, which exceeds the 1.2-fold increase expected from constant electric field scaling.

Device scaling was implemented without lowering voltage up until 1995. In other words, instead of “constant electric field” scaling, “constant voltage” scaling was the reality.

Under this scenario, current I increases by a factor of α. Combined with a C reduced by a factor of 1/α, circuit delay decreases by a factor of 1/α2. As a result, there is additional increase in circuit speed. However, power density increases rapidly as a function of α3.

This actual scenario was driven by the desire to generate more revenues through offering even higher performance chips. The increase in power was not too much of a problem since it was very small to begin with.

Consequently, chip power increased 1000-fold in the 15 years between 1980 and 1995. As a result, the amount of heat dissipated in a unit area of the chip reached 30 times that of a hotplate used for cooking.

If the dissipated heat is not completely removed, internal device temperature will rise, resulting in degraded reliability. As scaling hits this power wall, it prevents further increase in integration density.

The power wall was the consequence of overly aggressive scaling.

Eventually, supply voltage started to decrease gradually from 1995.

Furthermore, additional, incremental efforts were made to conserve power, including aggressively turning off power to circuits not in use and lowering supply voltage when high performance is not required.

While these are obvious actions to conserve power in our daily life, it is not easy to identify wastes in a large-scale integrated circuit that consists of more than 100 million transistors.

The theoretical lower limit on supply voltage is 0.036V at room temperature. Below this limit, CMOS circuit gain falls below 1, and it becomes impossible to cascade digital circuits.

However, in actuality it is difficult to lower voltage beyond 0.45V for multiple reasons, including transistor leakage current in off state, device variations, and noise.

From the 28nm generation onwards, increase in integration density is accompanied by increase in the number of unusable transistors. In other words, dark silicon (transistors that remain “dark” due to their power being shut off) is increasing rapidly. This means that even though additional functionality can be integrated, it is difficult to actually deliver the added performance.

Therefore, only designers who can improve power efficiency can achieve higher performance. Put differently, there is “no performance improvement without improvement in power efficiency.”

Besides lowering supply voltage, power efficiency can also be improved by reducing capacitance C. Therefore, technology to stack and integrate chips in 3D is key to the future of IC. In other words, we need to raise integration level from 2D to 3D. Since chip thickness is 3 orders-of-magnitude smaller than chip surface dimensions, the shift to 3D integration can dramatically shorten inter-chip connections, thereby reducing their capacitance.

The Power of Exponential Growth That Is Beyond Our Intuition

There is a story about an elderly man taking care of the koi fish in a pond. His responsibility was to periodically remove lotus leaves from the pond to ensure that it got enough oxygen for the fish. Since the leaves did not grow that rapidly, he thought it would be fine for him to be away for a week. But when he returned, to his surprise the entire pond was covered in lotus leaves.

This is a story that illustrates the characteristics of exponential growth. (The number of coronavirus infections grows in the same way.)

Our intuition predicts changes by linear extrapolation. We developed this and had it encoded in our DNA in ancient times when we had to protect ourselves from wild animals in the jungle which moved at constant speed. Even in modern days, we frequently predict the future by linear extrapolation of changes that have occurred.

However, the world created by the IC grows in an exponential rate. AI is one such example. It is why AI adoption has been skyrocketing ever since it suddenly burst into existence.

The growth in data volume generated by the IC has also been exponential. The volume of internet traffic is quadrupling every year (Gilder's Law).

By the second half of the 21st Century, the number of transistors that can be integrated on a single chip may rival the total number of neurons in all human beings combined. Furthermore, building a giant brain by wirelessly connecting together all the chips in the world may not be just a dream anymore.

The world is undergoing dramatic changes in less than 100 years since the invention of the IC.

Fig. Since technology improves exponentially, it changes at a much faster pace than the linear prediction of human intuition.

Chapter -4-

The Transition from 2D to 3D

IC in The Next Half Century

The Connection Problem in Large-Scale Systems

Integrated circuit (IC) was invented against the backdrop of a connection problem - the challenge of wiring - in large-scale systems.

ENIAC, one of the earliest computers developed in 1946, contained about 5 million handmade connections. As systems scaled up, the number of connections exploded.

This problem was known as the tyranny of numbers. After trying to solve the problem using various approaches, the definitive solution that emerged was the IC.

Since then, the IC underwent exponential improvement following Moore’s Law, with computer performance increasing dramatically in lockstep.

However, due to large data movement between memory and processor, inter-chip communication has become the main reason for the erosion of energy efficiency. This is known as the von Neumann bottleneck.

With the combined effect of data explosion, the industry has fallen into a situation where there is “no computing performance improvement without improvement in energy efficiency.” The problem has persisted to this day.

CMOS circuit energy consumption is proportional to its loading capacitance. The loading capacitance of computational circuits can be reduced through device scaling.

However, since data movement involves charging and discharging the capacitance of the entire communication path, even if devices are scaled down, energy consumption cannot be reduced without shortening the communication distance.

Data movement can consume a lot more energy than computation.

For instance, compared to 64-bit data processing, moving the data to the edge of the chip and out of the chip to DRAM require 50 and 200 times more energy respectively.

Another reason that has contributed to the large energy consumption of inter-chip communication is the aggressive increase in transmission data rates. This was due to the placement of communication channels along the chip perimeter only, making it difficult to increase the number of channels.

IC computing performance has been increasing 70% annually, as a result of a 15% transistor speedup combined with a 49% increase in functional integration density.

To take full advantage of this chip functional improvement, the speed of data movement in and out of the chip must increase accordingly.

Using the rule of thumb known as Lent’s Rule that infers the increase in the number of communication channels (chip I/Os) required to keep up with the increase in the scale of computing logic, the inter-chip communication speed needs to increase at a 44% annual rate.

However, device scaling can yield only a 28% annual increase in inter-chip communication speed. This is because transistors speed up by 15% while the number of I/Os increases only 11% since I/Os are placed only along the chip perimeter.

Even if I/Os are placed across the entire surface of the chip, signal trace congestion will make it difficult to route all the signals out of the chip on the printed circuit board without using a lot of board layers.

As a result, innovative circuit technologies have been used to raise inter-chip data rates to close the gap. Unfortunately, as is generally true, this aggressive push of transistor performance to its limits requires consumption of a lot of energy.

The energy consumed by inter-chip communication started to rise with the 130nm generation (around year 2000). We are now approaching the limits of how much more we can increase communication speed.

We can thus conclude that in order to increase the energy efficiency of computing, we need to shorten the distance between memory and processor, as well as increasing the number of connections to avoid pushing up data rates too aggressively.

In other words, we should stack chips in 3D to minimize inter-chip distances and to allow the entire chip surface to be used for interconnections to enable a moderate data rate. This is the reason why we are transitioning from 2D to 3D chip integration.

Moving away from relying solely on on-chip integration, we have been evolving from 2D to 3D chip integration, which calls for a breakthrough solution to the connection problem.

TSV and Communication through Magnetic Coupling

Against this backdrop, the research and development of TSV (Through Silicon Via) for vertically connecting stacked chips was started in 1990. Compared to conventional processing which requires penetration of only a few microns from the chip surface, TSV processing requires penetration of a few tens of microns, and is therefore challenging.

Furthermore, solder connections are difficult to shrink, and there are stress and reliability problems arising from thermal coefficient of expansion mismatch between materials.

To this day TSV incurs high cost and has low reliability. More than a quarter century later there is still no solution on the horizon.

The alternative is to use magnetic coupling for inter-chip communication (TCI; ThruChip Interface) instead. With TCI, communication is achieved by using the digital signal to be transmitted to control the direction of current flow in coils formed on the interconnect layers of the transmitting chip to alter the direction of the associated magnetic field, which in turn changes the polarity of the signal induced in matching coils on the receiving chip, which is then detected and used to regenerate the original digital signal in the receiver. In other words, communication is achieved through magnetic coupling between coil pairs.

Since all materials used in semiconductor chips have a relative permeability of 1, magnetic field can readily penetrate through chips. Furthermore, since CMOS circuits operate on electric field effect, there is no need to worry about interference.

However, TCI’s biggest merit is that its electrical connections are formed using wafer process and standard CMOS circuits, as opposed to the mechanical connections of TSV which are formed in packaging and assembly process.

Since TCI is implemented using digital circuits without modification to the chip manufacturing process, it can be realized at low cost by anyone. While TSV may increase DRAM cost by 50%, with TCI, the cost adder can be kept below 10%.

Moreover, as the chip thickness is reduced, the cost-performance of TCI can be dramatically improved.

For instance, if the chip process is scaled by 1/2, and in addition the chip thickness is reduced by 1/2, TCI’s data rate can be increased 8-fold while its energy consumption reduced to 1/8.

However, TCI cannot be used to deliver power. The current solution is to use TSV for power delivery and TCI for signal connections. You may wonder if it is not easier to just use TSV for signal connections as well. The reason is that TSV failures are mainly open failures. As a result, TSV failures can be solved with redundant connections. While it is difficult to add redundancy to signal connections, redundancy is built into the power delivery network since it consists of highly parallel connections.

Research and development of a new power delivery technology using highly doped regions of impurities (HDSV; Highly Doped Silicon Via) has also commenced as a replacement for TSV.

The transition from 2D to 3D integration results in higher chip energy efficiency. However, as chips are stacked on top of each other, power density of the stack rises. Consequently, it is necessary to further improve power efficiency to avoid generating more heat in the stack than can be removed.

Fig. In-package 3D integration of memory and processor improves energy efficiency.

Time is Right for Disruptive Technology Adoption

In startup jargon, there is a valley of death stretched between the research and adoption stages. It is difficult for a disruptive technology to cross the valley to achieve adoption.

For a connection technology, both sides of the connection must agree to adopt the technology.

When introduced to a processor company, even if TCI generates interest, the next question that arises is when memory companies will adopt it.

When the fact that the processor company shows a lot of interest in TCI is communicated to the memory companies, they respond that until all their major customers ask for it, it is difficult to adopt the technology since it requires major investment. Given that memory is a commodity business, the memory companies generally tend to be conservative.

It is hard to resolve this chicken-and-egg problem.

However, the challenge of no computing performance improvement without improvement in energy efficiency has necessitated the transition into a new era of 3D integration, which makes the time right for the adoption of disruptive technologies (which we would like to call revolutionary technologies).

Nevertheless, it remains difficult to motivate the memory companies. Therefore, a better approach is to start by stacking SRAM chips in 3D to offer memory capacity to processors which can rival that of DRAM. Since SRAM can be developed by the processor company, the adoption decision can be made by the processor company alone. Meanwhile, DRAM scaling appears to be nearing its end.

Chapter -5-

Connecting The Brain to The Internet

Internet of Brains


2020.04.26  Format for Print

A Magical Image Spotted in Cambridge

The year was 2019. Spring came late to Cambridge, Massachusetts. Even though it was May already people were still wearing thick coats.

As dusk fell, the beauty of the Harvard campus became even more prominent. The crowd of students crossing the freshly green lawn was getting thin, while orange incandescent light from dormitories started to permeate the air. As darkness set in on the campus, I felt the weight of its long history like a curtain coming down on me.

It is where what humankind has learned is passed on to the next generation, and then new knowledge is born. Stimulated by the atmosphere, I felt an urge to study there. With deteriorating eyesight, I need to make an effort even to just read a book. Yet, I felt that it must be fulfilling if I could keep repeating and continue to learn in school my entire life.

But if that was allowed, school campuses would be overflowing with senior citizens. “Only if I had been able to visit this place while being a little younger ...” was the sentiment that overwhelmed me.

The next day was for meetings on AI chip research. I needed to go to MIT in the morning and Harvard in the afternoon. From The Charles Hotel in Harvard Square it takes only 15 min by the Red Line subway to go from the nearby Harvard Station to Kendall Station where MIT Media Lab is located. But instead, I decided to take a walk.

Charles River was not visible. Still, I walked leisurely while the boat racing scene from the movie The Social Network ran through my head.

Unfortunately, I was not able to find anything interesting along the way as I had hoped. After walking for almost an hour and just when I was getting tired, I finally came to the intersection between Main Street and Vassar Street close to my destination.

At that moment, a magical illustration suddenly jumped out at me.

It was being projected on a large screen set up in the entrance hall of a contemporary building that was radiating blue light due to reflections off its glass windows. “McGovern Institute for Brain Research” was written on the building.

I entered the building while looking up at a monument in the shape of a twisted, giant tree and sank myself into a soft couch. There was a security gate in the back. Young researchers were hurrying in and out while holding a smartphone or a cup of coffee in one hand.

These were talented people gathering from around the world. You could see the spirit and self confidence in their eyes, which is common to people on the forefront of the world’s most advanced research.

What was being shown on a 100” display was a slideshow that introduced the Institute’s research.

“This is it!”

The magical image that caught my attention was being projected there.

It resembled both an astronomical photograph and an abstract painting. It was a microcosm composed of numerous curled threads that radiated rainbowlike color in the dark. It looked like sperm in formation charging into an egg.

From its title of “New Image of The Brain,” I realized that it was an illustration of the brain’s neural network. It was a blueprint of the brain which you could freely alter by changing your viewpoint in 3D.

“Prof. Boyden’s lab has developed a technology for imaging the interior of a brain cell including its protein and RNA.”

Then came the next slide which depicted a scientist holding a preparation (a glass slide for microscope specimen) in his hand that was radiating a phosphorescent light. It was entitled “Expansion Microscopy.”

Photo credit: McGovern Institute for Brain Research at MIT in 2017

Expansion Microscopy and The Opposite Approach

“Expansion microscopy?”

“Does that mean being able to enlarge cells and tissue? Are we talking about Alice in Wonderland syndrome?”

Googling “expansion microscopy,” I found an article entitled “Blown-up brains reveal nanoscale details” published in the Jan. 2015 issue of Nature (vol. 517, issue 7534).

The lead sentence read “Material used in diaper absorbant can make brain tissue bigger and enable ordinary microscopes to resolve features down to 60 nanometres.”

Details of the technique were explained in the text. First, protein of brain tissue is tagged with fluorescent molecules. Next, acrylate is infused into the brain tissue to be combined with the fluorescent molecule tags. After polymerization, the acrylate polymer forms a mesh within the brain tissue.

After decomposing the protein in the brain tissue, water is added to the remaining acrylate polymer. As the polymer absorbs water, it swells like diaper, which causes the separation between fluorescent tags attached to the mesh to grow in all directions. As a result, the fluorescent tags which were previously too close to each other to be distinguishable under an optical microscope are now individually visible.

In other words, it is like copying the positions of brain tissue’s protein to a paper diaper and then swelling the diaper with water to enable them to be observed using an optical microscope. The rainbowlike illustration in front of me was a vivid 3D computer rendering of the image. It was an amazing visualization.

In the slide show, Prof. Ed. Boyden asked, “If you want to see the brain better, what would you do? You can shrink the scientist, or you can enlarge the brain tissue.”

Of course, Prof. Boyden has chosen the latter.

But me, I would shrink the scientist!

This is when my imagination started to run wild. Let’s create a small microscope by integrating 1 million sensors onto a 100 μm2 chip. Each sensor is thus 100 nm2 in size.

If we can insert this chip into the brain tissue to observe it from a close distance, shouldn’t we be able to distinguish features that are 60 nm apart? By using many such chips to capture images and collecting the data wirelessly, shouldn’t we be able to recreate the overall picture? Letting my imagination continue to run free, I forgot about both time and my fatigue.

At the time I was working on an ACCEL project sponsored by the Japan Science and Technology Agency (JST). The original focus of the research was to improve computer power efficiency. But with the AI boom, I had started to contemplate creating mobile AI “eBrains.”

If we can embed small chips into the brain, we can connect the brain to the internet. We can then realize an Internet of Brains (IoB) to succeed Internet of Things (IoT). Further down the road, maybe we can move from the brain to the cell and create an Internet of Cells (IoC).

Maybe not. Before that, we should create a human intranet where sensors and actuators worn by a person are connected to their brain. A computer that is integrated into the human brain and body should be able to expand our ability including our senses and immunity and support the social life of senior citizens.

This has become my dream.

If The Brain Is Connected to The Internet ...

The brain and the computer are closely connected.

The brain created society and gave birth to the mind. Human acquired languages and logic in order to recognize and communicate their intention.

Furthermore, as a tool for expanding our cognitive capacity, we developed mathematics. Eventually, high level abstraction of mathematics allowed us to compute in our head instead of using our body (hands and fingers) which led to the birth of the computer.

Like how Dr. Ichiro Tsuda expressed the universality of consciousness by saying “the mind is all mathematics,” or as Mr. Masao Morita described in his book The Body That Created Mathematics, computer and AI were born as a result of abstraction.

If we can create eBrains which include both a left and right halves just like the human brain, wouldn’t the left brain be able to abstract images and sound into words in its association area after they have been detected by the right brain?

If the brain is connected to the internet, and its ability to innovate grows as the number of people connected increases, will ideas reproduce so fast as to completely overwhelm the earth, like the way Matt Ridley described in his book The Rational Optimist?

And then will the aggregation of agents give birth to The Society of Mind as Marvin Minsky advocated? Consciousness and art were born after words. Will computer evolve in the same way as human being?

(Or will Mr. Takeshi Yoro laugh it away by calling it “silly?”)

Chapter -6-

The Post Covid-19 Semiconductor

From A Necessity of Industry to The Brain Cell of Society


2020.04.27  Format for Print

The Tremendous Energy Consumption of A Remote Society

An American friend of mine built his home in the middle of a forest and works there remotely. Since he is an EDA developer, he can work from anywhere as long as he has a computer and internet access. So I thought. However, ...

Covid-19 has swung open the door to a remote society. Online meetings work better than we thought and are great for discussions involving 3 or more people.

Even international conferences with as many as 3000 participants have gone online.

Back in 2005, in my opening speech as program chair at the reception of an international symposium held in Kyoto, Japan, I said the following:

“Imagine this. In the future, we may hold an international conference on the internet. Everything is done online – research paper presentations, panel discussions, even hallway conversations. And everyone can participate from home. ‘What about the banquet?’ You ask. We get pizza from delivery and beer from the fridge ... That doesn’t sound too exciting, does it? Instead, let’s enjoy tonight Kyoto cuisine and Japanese sake while exchanging thoughts with old friends. Cheers!”

But now organizers of international conferences are probably worried that everything in the Pandora’s box is out, since even drinking parties have been taken online.

The surge of big data and the advance in AI processing are fueling a digital transformation and enabling data-driven services, which in turn are leading to explosive growth of energy consumption in society.

It is forecast that by 2030 IT related equipment alone will consume close to twice the total electrical power generated today. And by 2050 that total power consumption will further increase to 200-fold.

One of the reasons is the exponential growth of data communication. While total annual IP traffic in 2016 was 4.7 ZB, it is expected to grow 4 times to 17 ZB by 2030 and 4000 times to 20,200 ZB by 2050.

Adding to that is the increase in sophistication of AI processing. It requires a tremendous amount of computation to extract meaning hidden in data to drive services that deliver value to society.

As a result, we need to be able to improve the energy efficiency of communication equipment and computers by orders-of-magnitude in order to sustain society’s growth.

It is semiconductor that is causing the rapid growth in energy consumption. It is also semiconductor which holds the key to solving the problem.

Semiconductor: From Being A Necessity of Industry to The Brain Cell of Society

In 2019 the world made a total of 1.9 trillion semiconductor chips.

The market breakdown by sector is as follows: manufacturing 15%, healthcare 15%, insurance 11%, banking and securities 10%, wholesale and retail 8%, computers 8%, government 7%, transportation 6%, public utilities 5%, real estate and business services 4%, agriculture 4%, communication 3%, others 4%.

You may be surprised to learn that the communication market is still small.

Meanwhile, it is 5G, the next generation communication technology, which is expected to drive the demand for the next generation (5nm) semiconductors.

Both 5G and what follows including Beyond 5G use high frequency bands. The higher the frequency, the more the signal travels in a straight line, but the shorter the distance it can travel. Consequently, higher frequency requires more base stations.

In addition, since low-latency, high-value services are desired, base stations are expected to perform sophisticated data processing. That is why 5G is anticipated to drive the demand for next generation semiconductors.

Going forward, services are expected to create a large semiconductor market in addition to IoT, digital medical and healthcare including remote medical services, and mobility. Together they will form the nervous system of society.

In other words, semiconductor is evolving from being a necessity of industry to the brain cell of society, making it more and more a global common - a resource shared by the world.

The only way to solve society’s energy problem is to increase semiconductor’s energy efficiency. Compared to general-purpose chips, specialized chips can achieve a power efficiency that is two orders-of-magnitude better.

However, since the development cost of specialized chips is high, not everyone can afford to develop them.

It is d.lab’s goal to reduce the development cost of specialized chips to 1/10 in order to enable anyone with innovative system ideas to design their own specialized chips, as well as to reduce energy consumption to 1/10 by adopting the most advanced semiconductor technology.

To enable semiconductor to shift from being a necessity of industry to the brain cell of society, it is also necessary to transform the structure of the industry from the capital-intensive industry of the last half century to a knowledge-intensive industry for the next half century.

To Create A Digital Civilization ...

Yuval Noah Harari, the author of Sapiens: A Brief History of Humankind, recently wrote an article where he warns that technology can become a spy who carries out “under-the-skin surveillance” by monitoring our biological conditions.

In combating the spread of covid-19, surveillance is becoming more and more widespread in society. The impact of technology on society has become exceedingly large. What is that going to do to our civilization? Some even believe that we have come to a critical juncture.

Technology can implement any human intelligence. Therefore, while semiconductor can threaten our security and privacy, it can also be the solution.

However, it will naturally increase energy consumption in order for semiconductor to offer sophisticated security and privacy protection. Therefore, once again it brings us back to semiconductor’s energy problem.

Further ahead, there is also the problem of the mind.

Digital excels at handling logic, while analog emotions. And now we are starting to use digital to pursue happiness.

We cannot blindly promote the connection of our brain to the internet without first laying the necessary foundation in place which includes sensors and actuators that convert our five senses to digital and vice versa, control technology that feeds back our senses, engineering of value exchange technology (such as blockchain), and a legal system that prevents technology from making society dangerous,

A long, long time ago, the brain created society and gave birth to the mind. Human came to recognize their intentions and created languages to convey them, while also developing mathematics to expand their cognitive ability through logical thinking. Eventually mathematics evolved into a system of abstract symbols that exceeded our subjective intuition and overflowed from our brain to create the computer. The computer then gave birth to chips, which in turn enable downsizing of the computer as a result of exponential growth through scaling. At some point the computer will become so tiny that it can be put back into our body.

Fig. Chip power efficiency improved by 3 orders-of-magnitude in 20 years and is expected to approach that of the brain by 2030.

Chapter -7-

Agile Development

Chip Development Model in The AI Age


2020.05.11  Format for Print

From Waterfall to Agile Development Model

To be agile is to be quick and adaptable.

The mainstream model for system and software development has been a waterfall model. It starts with writing specifications and development plans, followed by top-down design and implementation based on the plans. Because development proceeds sequentially where you are not supposed to go back to preceding steps, it fits the analogy of a waterfall where water flows only downwards and never returns from downstream to upstream.

Agile development is the opposite. It is a bottom-up approach where the design is broken into small units each of which is developed through iterations of implementation and test. The model first appeared in 2001. Since it generally results in shorter development time compared to the waterfall model, it is called agile development.

Another advantage of agile development is the ability to change or add to the specifications midway through the development. However, because of that, it is easy to deviate from the original direction of development, and it is difficult to see the big picture and manage the schedule, which are its shortcomings.

Given that changes will likely occur to the specifications and design midway through development, it suffices to start with just rough instead of detailed specifications. By establishing the resilience to flexibly adjust to changes as they occur during development, you gain the ability to better address customer needs.

Once the rough specifications and plans are decided, the system is broken into small units. While development proceeds through planning, design, implementation, and test, releases are made iteratively once every 1 to 4 weeks.

Chip design, however, is top-down.

Specifications written in words and diagrams are translated into hardware description language such as Verilog, while clock cycle-based processing sequences are coded into RTL. This is followed by logic design, circuit design, and layout, which is finally made into geometric patterns in photomasks. In this manner, chip design is completed as a series of conversion steps with decreasing level of abstraction.

The design work is shared between the system maker and the chip design house. As user of the chip, the system maker is responsible for the frontend design which includes up to RTL design, while the chip design house is responsible for the backend that starts from logic design.

Fig. System makers develop chips in an agile manner like how software is written.

To improve design productivity, computer automation was introduced into the design process in reverse order starting from downstream which involves a lot more information. It was introduced to mask design in the 1970s and layout design in the 1980s. Logic design was automated in the 1990s. High-level synthesis that automates system design entered into research stage around 1990, and partial, practical application around 2010.

However, the common way to improve system design productivity is the reuse of RTL. Design IPs that implement general-purpose functionalities such as processor core and memory controller are widely available. Furthermore, RTL of specialized circuits are usually not created from scratch, but from the reuse of previously designed RTL.

Even with such practices, large-scale chips today, such as Apple’s A12 processor which integrates 6.9 billion transistors, require several hundred engineers spending several years to develop, costing 100s of millions of dollars.

As integration density continues to rise exponentially, the conventional development approach is reaching its limits.

Add to that the emergence of AI. AI is evolving rapidly. Even technology from last year is now inferior. It simply is too risky to engage in development of chips that requires time measured in years and costs in the 100s of millions of dollars.

Agile Chip Development

We believe that agile development can be applied to system design and verification by chip users.

A system can be developed by dividing it into small units which are described using languages such as C/C++ or Python. The RTL of these units are automatically created using high-level synthesis tools, which are then assembled to generate the system from bottom up.

By enabling agile development of chips in a way similar to how software is written, development time and costs for system makers can be greatly reduced, and their risks significantly lowered.

Using high-level synthesis tools, RTL of various combinations of circuit performance and layout area can be created in an instant. Therefore, these tools enable search for the optimal RTL by trading off between performance and area, which is then implemented in FPGA or verified using ASIC simulation tools to support release iterations within short timeframes.

In conventional approach, after studying the specifications thoroughly the designer creates block diagrams and estimates meticulously performance of each block and congestion of signal connections before starting the design. However, since it is difficult to quantify performance and area in the early stage of the design, the designer must rely on their intuition and experience. More importantly, the task quickly becomes overwhelming as system complexity increases.

Using an agile development model, the system is divided into small units the functional block of each is automatically created, verified, and released in iterations using the computer.

The bottom-up assembly of the released blocks can also be computer-automated. This is because by using high-level synthesis, it is possible to distribute control mechanism into individual functional blocks, such that the overall control mechanism is established when the functional blocks are assembled together.

In other words, large-scale chips can be created by assembling individual functional blocks, much like parallel distributed programs in software.

Using C/C++ or Python instead of RTL description can reduce the number of lines of code by 1/100x. As a result, the effort and time required by designers such as to review and simulate can be reduced by orders-of-magnitude.

Since structure of the circuit in high-level description is expressed in terms of parameters, it enables a wider spectrum of implementation options, as well as a good grasp beforehand of realizable implementation ranges of functionality, performance, and interface protocol.

In addition, by coupling the verification model to the design description such that changes to the latter are automatically reflected in the former, it not only makes it easy to confirm the extent of impact of the changes, but also enables effective assembly of the verification environment in conjunction with the design. In other words, agile development can straddle both design and verification.

Using this approach, functional blocks are connected using dedicated control circuits, making it possible to increase energy efficiency. In conventional approach where IPs are connected to the CPU bus to allow the CPU to perform centralized control, it is difficult to realize high performance in complex processing such as required for 5G (communication), H.265 (video compression) or WPA2 (encryption).

Furthermore, in conventional approach where the RTL is designed with the intention for reuse in other projects, the circuit tends to be designed to be higher performance than is needed. By contrast, when high-level synthesis is used, circuits can be automatically generated each time with performance and area optimized for the particular usage.

Divide and Conquer

The first thing I learned about CAD at UC Berkeley is “divide and conquer.” The idea is that by dividing a problem into small problems of always the same size, you can find the solution to problems of any complexity by assembling the solutions of the individual, small problems. A lot of computer algorithms are designed using this idea.

Breakdown of the problem, solving of the small problems, and assembly of the solutions are carried out in a recursive manner. The result is a dramatic reduction in computation time.

For instance, if we compare the computational complexity of sorting algorithms, quick sort which employs “divide and conquer” reduces complexity to O(nlog2n) compared to bubble sort which has a complexity of O(n2). This corresponds to a reduction from 1,000,000 to 9,966 when n is 1000, representing a 1/100x reduction. Similarly, in search, binary search reduces search time from O(n) to O(log2n) compared to linear search.

The AI Age calls for rapid iterations of trial and error. AI is used to analyze large volumes of data to allow it to build a model. Then the model needs to be implemented rapidly to enable additional data collection and analysis to improve the model. And the process is repeated. It is vital to be able to skillfully execute such trial and error.

It is imperative to create a chip development model suited for the AI Age that meets the conflicting constraints of staying agile while enabling large-scale design.

I learned a lot from China both regarding agile development and data collection. That reminds me of a foreign student from China who often said to me, “Professor, you are meticulously overprepared.”

Chapter -8-

Silicon Compiler

Making Chips The Way Software Is Written


2020.06.03  Format for Print

Silicon Compiler 1.0

A compiler is a piece of software that converts source code to object code. Because source code is written in a high-level language that resembles a natural language, it cannot be understood by the computer as-is. Therefore, a compiler is used to translate it into a machine language in the form of object code, which is binary code executable by the computer.

Similarly, software that converts hardware specification into a silicon chip is known as a silicon compiler. Software that converts from Verilog, a hardware description language, to GDS-II, a language describing mask fabrication, is one such example.

In 1979, Dave Johannsen from Caltech published a paper entitled “Bristle Blocks: A Silicon Compiler.” Since it was the same year when Carver Mead and Lynn Conway wrote the textbook for VLSI design entitled “Introduction to VLSI Systems,” which by the way fascinated us so much as to draw us into the world of VLSI design, one can say that silicon compiler is a very natural idea for complex chip design.

Johannsen’s adviser was Mead (see footnote), who in 1982 foresaw the coming of an age when specialized chips are made using silicon compilers and foundries in his paper entitled “Silicon compilers and foundries will usher in user-designed VLSI.”

Johannsen cofounded Silicon Compilers Inc. with Edmund Cheng in 1981. By using the GENESIS tool from the company, one could design a chip in one-fifth the normal time by choosing from the tool’s menu. The tool was used by DEC to develop its MicroVAX minicomputer.

However, the company did not otherwise achieve much success and was eventually sold. Another company which developed a silicon compiler - Seattle Silicon Technology – was not successful either.

You can watch here a speech by Johannsen in celebration of Mead’s 80th birthday that was infused with Johannsen’s love for his mentor.
There was an anecdote about the need for care in layout color assignments. In America where Mead’s textbook was used, red was assigned to polysilicon gate. But at Toshiba where I worked, red was for aluminum interconnects, so it caused a lot of confusion.

Even today silicon compiler has not achieved wide adoption. Why is that the case?

If a software program has a bug, it can be patched and fixed later. By contrast, if a piece of hardware has a bug, it must be fixed immediately. Furthermore, while the performance of a software program is expected to evolve in conjunction with the hardware, hardware is supposed to meet its performance specification before it can be shipped. As a result, hardware is much more difficult to design than software, and its development carries higher risk.

While one-click compilation is the expectation in the software world, it is a dream in the hardware world. Even for the major design tool companies like Cadence and Synopsys, the compiler tools they develop are for the skilled designers only. Being able to make chips the way software is written has remained a wild dream.

Silicon Compiler 2.0

Recently interest in silicon compiler is rising again. However, the reason is different than before.

Today chip design is about optimizing PPA – power, performance, area. Once, area and hence chip cost had the highest priority. Eventually, performance in terms of operating speed became important, and now power has the highest priority. That is because chip power has reached its upper limit, such that only those who can increase power efficiency can extract proportionally higher chip performance. In other words, chip performance is determined by power efficiency.

Compared to general-purpose chips which can do everything, specialized chips can achieve orders-of-magnitude higher power efficiency by eliminating unnecessary circuits. However, since specialized chips have much smaller production volumes than general-purpose chips, each chip bears a much larger portion of the development cost.

Chip design technology has not been able to keep pace with Moore’s Law. As a result, development cost has been rising rapidly in recent years, amounting to something on the order of 100 million dollars. As an example, if the development cost is $100 million and 10 million chips are manufactured at a unit cost of $10, development cost will amount to half of the total chip cost. Consequently, if the development cost can be reduced to 1/10x, even if chip area is increased by 1.5x, there is still a 20% cost down.

In the past development cost was small enough that chip area had the highest priority. Since development cost has been rising rapidly, its reduction has become important. Furthermore, with the fast pace of technological changes nowadays, beside lowering cost, shortening development time is also critical in order to reduce risk.

Combining the use of ASIC for orders-of-magnitude reduction in power with the use of compiler to reduce development cost and time, albeit at the expense of somewhat worse performance and area, can be a profit-generating formula. And if that leads to more chips being designed, it is possible to reduce mask cost from $10 to $0.1 million by using MPW (Multi-Project Wafer) prototype pooling services.

By adding high-level synthesis, chip description can be written in C, which will enable the number of chip designers to grow like software engineers. If the open source model can take root in the hardware world, its ecosystem can expand into a multitiered structure to enable mass collaboration. When that happens, it will become possible to make chips the way software is written.

At d.lab we are engaged in the research and development of a design platform to implement high-level synthesis of Verilog from C, followed by compilation from Verilog to GDS-II for ASIC development, after system design and verification are completed using 3D-FPGA.

Our goal is to democratize access to silicon technology. We want to enable system developers to rapidly create ASIC hardware. To achieve that, we aim to enable making chips the way software is written by using silicon compiler to improve development productivity by 10-fold.

Fig. Making specialized chips the way software is written by using silicon compiler.


The year was 1986. I was at Toshiba exploring the possibility of working with Silicon Compilers Inc. Through that work I came to know Tom Ho, who later became my best friend.

Tom graduated from UC Berkeley after immigrating from Macau to California. After serving as chief of the 80286 design at Intel, he joined Silicon Compilers Inc. at the invitation of Edmund. When we met, he was 31 and I was 27 years of age.

In a San Jose motel, we discussed circuits while drawing circuit diagrams in a notepad, and were so absolved that we forgot about time. It was Tom who taught me that an inverter with its output short to its input makes the best SRAM sense amp. The ABC (Automated Bias Control) circuit which I published later in 1991 was an idea that germinated out of that discussion.

I asked Tom where he learned about circuits and his reply was from Carlo Séquin at UC Berkeley. And when I said I wanted to visit Berkeley, he took me on a 1.5-hour journey one-way to get there, while carrying the thick manual of GENESIS along with him.

I was a visiting scholar at UC Berkeley in 1989. My host was Séquin who developed RISC-1 together with David Patterson. It was at UC Berkeley where Donald Pederson developed the SPICE circuit simulator in the 1970s, and where Richard Newton, Alberto Sangiovanni-Vincentelli, Robert Brayton and others led the research on design technologies including automatic layout and logic synthesis in the 1980s. It was a flourishing time when major companies like Cadence and Synopsys were born one after another. However, from around 2000, the EDA market gradually reached saturation, while technological advance slowed.

But recently, I hear a lot about students at UC Berkeley repeatedly taping out about once a month by writing RISC-V in Chisel. I can sense the coming of an EDA renaissance.

Tom, don’t you want to do silicon compiler again!

Chapter -9-

Synchronous and Asynchronous

Chip’s Rhythm


2020.06.10  Format for Print

Chip’s Synchronous Design

Half of a century ago, there was an intense debate about the pros and cons of synchronous vs asynchronous circuit design. The former uses a clock to synchronize circuit timing while the latter does not.

The following experiment was conducted at Caltech. The high achievers among a group of students were asked to complete an asynchronous design while the rest a synchronous design of the same chip. The result, while many of the synchronous designs functioned correctly, the asynchronous designs did not. What do you think happened with asynchronous design?

There are two types of logic circuit - combinational logic circuit where the output is always the same given the same input, and sequential logic circuit where the output can vary depending on the state of the circuit even when the input is fixed. In computation we expect the answer to always be the same, so we use combinational logic. Meanwhile in control, the action depends on the state, so we use sequential logic.

Let’s consider a state transition as an example. When a state represented by 2 bits {S1, S2} transitions from {0,1} to {1,0}, it may instantaneously be in a transient state of {0,0} or {1,1}. That is because S1 and S2 are generated by different circuits. Even if the two circuits are designed to be the same, it is difficult to synchronize their timing because of manufacturing tolerances creating differences in the individual circuit elements.

That instantaneous wavering, known as a dynamic hazard, will not cause any problem in computation because the final answer is still correct. However, in control it can result in errors. That is because if data happens to arrive at the instant of wavering, an incorrect control action can result.

But if we make data that arrives early wait for data that arrives late, and then release them all together at the transition of a clock signal, like how a traffic signal makes all cars start at the same time when it turns green, we synchronize their timing to the clock cycles.

We can hold data by using a loop consisting of two inverters. For example, if we input L into the first inverter, it outputs H, which is converted to L by the second inverter to feed back into the first inverter.

If we insert a switch into the loop that is controlled by a clock, such that the loop is closed to hold data when clock is L, and opened to allow data to pass through when clock is H, we create what is known as a latch to latch onto data.

If we connect two latches together and feed to the first latch a clock with opposite phase, we get a flip-flop. When clock is L, the first latch lets new data in while the second latch holds current data. When clock switches H, the first latch holds new data while the second latch allows current data to be read out. Therefore, at the instance clock transitions to H, different flip-flops are made to output data at the same time.

By the way, when clock switches from H to L, although the first latch lets new data in, the second latch has already latched onto and hence holds current data. As a result, the flip-flop holds its output at the current value while waiting for new data.

With the use of flip-flops, timing can be verified on a per cycle basis, reducing the cost of verification. Flip-flops are used in general chip design.

On the other hand, when latches are used, data can pass through as long as clock is H, so it is possible to recover even if data was delayed in an earlier cycle. However, to verify timing, one must check not only the current but also earlier cycles, making verification more costly. Latches are used in processor design.

Fig. Flip-flop circuit: data is read out on rising clock edge after being held for a cycle.

Rethinking Asynchronous Design

Careful timing design is necessary to achieve high chip performance. It is necessary to determine and then guarantee the required timing margin by considering the effect of external factors including manufacturing tolerances as well as variations in power supply and temperature on signal transmission delay of the logic circuit, by estimating jitters in clock generation and skew in clock distribution, and by taking into account the targeted manufacturing yields.

The required margin grows as process is scaled and supply voltage lowered. Furthermore, as clock speed grows, the cost of timing design which is known as timing closure goes up as well.

However, in synchronous design, since the clock period is determined by the slowest circuit known as the critical path, the performance of most of the other circuits are not really affected. Meanwhile, clock distribution and flip-flops alone consume 25 to 50% of total power.

As the cost of and waste in synchronous design became dominant, around the time when clock frequency crossed over 1GHz, research was started to reconsider asynchronous design. The paper by Ivan Sutherland entitled “Computers without Clocks - Asynchronous chips improve computer performance by letting each circuit run as fast as it can” was published in 2002. A portion of Sun Microsystems’ UltraSPARC IIIi adopted asynchronous circuits.

Sutherland is a genius who is considered the father of computer graphics. He is also skilled in chip design. He advocated the use of logical effort, a logic circuit delay model, in 1999. It is such a great model that I teach it in my class.

He also published a paper in 2003 on the use of electric coupling for chip interconnection. That was about the time when my team started our research on using magnetic coupling for chip interconnection. When I was MacKay Professor at UC Berkeley in 2007, I had the honor of joining him in faculty meetings.

Back to the main subject. Asynchronous circuit uses dual-rail logic where the two outputs remain equal when computation is in progress. When one output switches, a signal indicating that computation is complete is transmitted to the circuit in the next stage together with the result of the computation.

While asynchronous design consumes more transistors and interconnects than its synchronous counterpart, there may come a time when you end up with a gain given the waste in synchronous design.

I thought the time would come in the 7nm generation. However, because FinFET which was introduced to improve transistor gate control has better performance than expected, asynchronous design has not been shown to be able to overtake synchronous design at 7nm. It looks like transistor structure innovation will continue, so it may be a while before asynchronous design gains adoption.

Nevertheless, asynchronous design is well suited for parallel data processing using wired logic as exemplified by neural network which has attracted a lot of attention through its adoption by AI. (But then we may make an incorrect judgment after changing our mind multiple times.)

Rhythm in Nature

One day in 1665, Christiaan Huygens, who proposed the Huygens–Fresnel principle based on the wave theory of light together with Augustin-Jean Fresnel, noticed accidentally that the pendulums of two clocks placed side-by-side on the same wall of the room were moving in synchronization. One pendulum always swung to the right when the other to the left. Even when he intentionally decoupled their motions, they would eventually return to being synchronized.

However, when he put the two clocks on separate walls removed from each other, synchronization did not occur. Huygens postulated that the synchronization was the result of a very weak interaction between the two clocks.

In fact, rhythm can be found everywhere in this world. And when rhythm meets rhythm, they synchronize.

For example, when people walk across a suspension bridge, their steps can reinforce each other to cause the bridge to swing widely even without coordination. This is what happened to the Millennium Bridge across the Thames River in London in 2000. Things like trends and traffic jams are also rooted in the phenomenon of synchronization.

Synchronization also occurs in insects and cells. In southeast Asia, many fireflies gather in a mangrove forest and emit blinking lights in synchronization.

In mammals, the suprachiasmatic nucleus in the hypothalamus of the brain has about 20,000 cells which synchronize to create a biological clock inside the body to generate the rhythm for things such as the sleep cycle. Inside the heart, there are about 10,000 pacemaker cells which tirelessly fire in synchronization to generate a heartbeat for a total of 3 billion times in a lifetime.

Even heartless, inanimate objects synchronize. Superconductivity is achieved when many electrons march in lockstep resulting in almost zero electrical resistivity. Meanwhile, the formation of laser into a high energy beam is the result of many atoms emitting photons with synchronized phase and frequency.

On the other hand, we can always see the rabbit on the moon in the night sky because the moon’s rotation is synchronized with its revolution around the earth such that it always faces the earth with the same side. Furthermore, the gravitational pull of planets in the solar system can synchronize and be aligned to send meteorites from the asteroid belt towards the earth, which caused the extinction of dinosaurs.

Synchronization also occurs in man-made networks and virtual space. Power generators connected to high voltage power grid can synchronize naturally. It is the consequence of speed alignment through energy transfer from generators with high rotational speed to those with low rotational speed. This can lead to chain reactions resulting in disruptions when anomaly occurs. In addition, it has been observed before when routers on the internet synchronized like fireflies causing sudden fluctuations in traffic.

The first engineering attempt to control synchronization was made by Robert Adler in 1978 when he authored an analysis of the frequency entrainment phenomenon of oscillator circuit.

My research team was probably the first to attempt to synchronize more than two circuits through coupling. We succeeded in 2006 to use transmission lines to couple and synchronize the outputs of four oscillators integrated on the same chip. Next in 2010, we discovered the group synchronization phenomenon resulting from magnetic coupling of four chips stacked in 3D and used it to develop a technology for precise clock distribution to the chips.

Nonlinear analysis continues to be applied to understand such group synchronization phenomenon.

Chapter -10-


Time is Money


2020.09.23  Format for Print

Cost-Performance vs Time-Performance

One often hears the expression “The cost-performance is good.” Cost-performance is the most important metric in the semiconductor industry.

However, recently time-performance has become an important metric as well. There are two reasons for that.

The first reason is because society is changing from being capital-intensive to knowledge-intensive.

In the recovery after the war, Japan strove to become an industrial nation, and furthermore an electronic nation through the development of semiconductor technology. Both industrial society (Society 3.0) and information society (Society 4.0) are capital-intensive societies. Bigger is better, and mass production of standardized products and mass consumption are encouraged. But it has become clear that the resultant increase in the burden to the environment is limiting growth.

Japan is a rapidly aging society with low birth rate. As a result, the new Society 5.0 we are aiming for is a human-centric society where everyone contributes their wisdom.

A society where wisdom gives birth to value is also one which makes use of the individuals. It is Japan’s new strategy to create a society with sustainable growth where everyone thrives.

Digital innovation is the driving force towards that goal.

Unexpectedly the spread of Covid-19 is accelerating digital innovation. Digital innovation starts with the creation of a platform, where speed is the essence.

In a capital-intensive society, materials are the resources and things deliver the value. Specifically, materials are used to make components, which are used to create products. Wisdom in the form of services, design, and market strategy and so forth is then added to deliver societal impact. In this scenario semiconductors are components. Components must be low cost.

On the other hand, in a knowledge-intensive society, data is the resource and wisdom delivers the value. Specifically, AI is used to analyze data collected using IoT and 5G, and the results are used to create services and solutions. The power of semiconductor is then added to deliver societal impact.

In other words, there is a reversal of roles in value creation, where semiconductors are shifting into a role of higher value. The semiconductor business must also transition from the component business of the past to a societal impact business. A new strategy is required.

Another reason for the heightened importance of time-performance is because semiconductors are transitioning from being a necessity of industries to part of the infrastructure of society.

In a capital-intensive society, the infrastructure consists of roads, harbors and railways which transport materials being used as resources. In a knowledge-intensive society, however, information networks which move data as the resource form the infrastructure. And information networks are realized using semiconductors.

Cost-performance is important for the semiconductor business with semiconductors being components. Because consumer products like TV, PC, and smartphone are replaced once every several years, devices with high cost-performance that come later to the market are bought by consumers to replace their older predecessors. Therefore, cost-performance is important.

By contrast, because the replacement cycle for industrial products such as communication equipment and robots is more than 10 years, even if higher cost-performance products are available later on, businesses are not motivated to purchase them. In the end, products that get wide adoption are those which are first to market.

As a result, time-performance is important for the semiconductor business in Society 5.0. “Time is money.” Time is determined by development efficiency, while performance is determined by power efficiency.

Semiconductors for Post-5G

With 5G, base stations use software virtualization to deliver diverse services and address various use cases. In other words, it is necessary to construct a flexible network which implements functional virtualization of general-purpose servers and slicing.

On the other hand, beyond 5G, because the signal cannot travel far, the area covered by a cell shrinks, and as a result base stations are being miniaturized. In other words, base station power, volume (size), and weight must be reduced so that many of them can be inexpensively deployed in urban areas. Specifically, telecommunication carriers are asking for “5W, 5L, and 5kg” as target.

The limited power consumption required of small base stations limits performance of their servers. To compensate for the inadequate performance, hardware accelerators with high power efficiency become necessary. The trend is to add FPGA and ASIC-based network cards to the servers so they can handle the heavy-duty, routine processing in hardware.

As a result, even though general-purpose servers are adopted starting with 5G (specialized hardware using ASICs was used up until 4G), what determines performance and cost is FPGA and ASIC.

The added costs, power, volume, and weight when FPGA or ASIC accelerators are added to general-purpose servers have been computed and compared in the following table. While the absolute values vary with the assumptions, the table can be used for relative comparison.

Fig. 5G Base Station Hardware Time-Performance Comparison
RaaS is undertaking the R&D of Agile 3D-FPGA and Agile 3D-ASIC.

If you compare realizable performance under the same power constraints between server, FPGA, and ASIC, the ratio is 1:2:8. In other words, ASIC is a very effective way of delivering performance. CPU and FPGA have poor power efficiency because it is necessary for them to include a significant amount of extra circuits in order to support programmability. The need to support software backward compatibility creates additional burden of history accumulated in the circuits.

However, ASIC has its challenge in high cost due to limited production volume. From 7nm process onwards, mask cost alone amounts to $10M, and EUV lithography will remain expensive until its equipment is fully depreciated. Nevertheless, if production volume reaches 100k units, its total cost including development cost can be reduced to 1/10x of server. It can thus increase server’s profitability proportionally.

The recent worldwide trend of developing specialized chips (ASICs) instead of using general-purpose chips is driven by the goal of achieving lower power and cost. In other words, it is because of better cost-performance. By making ASICs, you can achieve both higher performance and lower cost.

There was a time when communication equipment makers also actively developed ASICs. In the 1990’s transistor count was only on the order of 100k and an ASIC could be developed in a few months. By contrast, transistor count has climbed to a few billions and design alone can take more than a year.

In other words, the challenge of ASIC is that as device density increases, the time required for design and verification has increased to an unacceptable level. On top of that, Japan has the challenge of continuously losing its ASIC design capability. The outflow and loss of engineering talents resulting from the decline of Japan’s semiconductor industry has been painful.

Because communication is an infrastructure business, what is most important is business continuity. Telecommunication carriers who can secure vested interest in the form of frequency allocation and create a stable business have the ability to decide a specification and attract bidding by multiple suppliers. Meanwhile, as a result of M&A driven by intense international competition, only a few mega suppliers have been able to survive. Nevertheless, the recent trend of securing supply chain to ensure economic security is driving a re-examination of such an industry structure.

In the communication equipment business, it is common for the supplier that is first to market to win market share. As a result, supplier competition has led to shortening of the lead time from when specifications are finalized to when products are launched.

Time-performance is important for AI too. That is because AI technology is advancing so rapidly that technology from a few years back is already obsolete.

Using the Computer

I have heard the following from someone in the telecommunication carrier business. “In contrast to Chinese makers taking only 2 months to design an FPGA, it takes Japanese makers more than 6 months. While it may partly be due to differences in business culture, close examination reveals that Chinese makers achieve the shorter turnaround time by throwing a lot of human resources at the task.”

The strategy that Japan should adopt is one that drives the task with computer and leaves no human in the loop, instead of relying on vast human resources.

At RaaS, we are pursuing large time-performance by engaging in R&D to achieve 10-fold increase in development efficiency as well as 10-fold increase in energy efficiency.

To achieve 10-fold increase in development efficiency, we are pursuing R&D of an Agile Design Platform (Agile 3D-FPGA and Agile 3D-ASIC in the previous table) and adopting open architectures such as RISC-V through international collaboration. Our goal is to use the computer to fully automate design and verification and eliminate possibility of errors by taking human out of the loop.

In parallel, to achieve 10-fold increase in energy efficiency, we are pursuing R&D of 3D integration technology as well as utilizing advanced CMOS processes through our alliance with TSMC. By stacking multiple chips and integrating them into a single package, we will be able to shorten the distance of data movement by orders-of-magnitude, thereby significantly improving energy efficiency.

In fact, this strategy shares many objectives with DARPA’s Electronics Resurgence Initiative (ERI) in the US. The difference is the incorporation of Japan’s area of expertise of 3D integration. The Agile Design Platform will be the product of combining EDA and 3D integration.

Japan’s telecommunication carriers outsource their chip design to Qualcomm (US), Mediatek (Taiwan), Broadcom (US), and HiSilicon (China). Our goal is to make it possible for them, as chip users, to use the computer to design their own advanced chips without relying on overseas chip design houses.

Chapter -11-

AI Chips

Lessons from The Brain


2020.10.21  Format for Print

Birth of The Computer from Mathematics

In ancient times, humans counted by folding their fingers and measured distance by counting their steps. However, humans cannot perceive large numbers. The invention of computing tools during the time of the four ancient civilizations helped expand human’s cognitive capacity.

During the time of ancient Greece, mathematics evolved from being a tool into a way of thinking, and the interior world of mathematics became a subject of research. The invention of symbolic algebra in the 15th century during the Renaissance made it possible to inquire about the nth dimensional space which is hard to represent in the physical world. In such manner, mathematics achieved widespread application without being bounded by physical constraints.

Eventually in the 17th century calculus was invented which enabled study of the world of infinite. After close examination of the concepts of limits and continuity, abstract symbolism was born which exceeded what subjective intuition could achieve. Then as we entered the 20th century, efforts were started to make mathematics about how to perform mathematics.

As a result, mathematics moved out of our body into our brain, completely departed from ambiguity represented by the likes of physical intuition and subjective feelings, and finally flowed out of our brain, giving birth to the computer.

Early electronic computers were plagued by frequent breakdowns of the vacuum tube. The problem was solved in 1948 by the invention of the transistor which manipulates electrons in solid instead of gaseous form.

In addition, wired logic used in early computers where the wiring determined the computer’s functionality suffered from two challenges. The first was the challenge of scale where the size of the largest programs that could be processed were limited by the scale of the available hardware. The second was the challenge of wiring where the number of interconnections exploded as the scale of the system grew.

To overcome the first challenge, von Neumann invented the stored program architecture (von Neumann architecture), where the data to be processed together with the commands that control both data movement and processing are first stored in memory, then the processor retrieves and interprets the commands one after another and executes the corresponding computation. This represented a paradigm shift in architecture from using multiple processing units physically wired in a way to implement certain processing, to using one processing unit that executes a different command in each cycle in order to overcome the challenge of scale.

Meanwhile, Jack Kilby invented the integrated circuit (IC) in 1958. The challenge of wiring was overcome in a brilliant way by using photolithography to integrate many devices and wires onto a single chip. Eventually silicon was found to be the best material for fabricating ICs.

In this manner, integration and parallelization of simplified and miniaturized computing resources onto silicon chips resulted in a dramatic improvement of computer performance. In turn, computers with higher performance enabled design of even larger scale integrated circuits.

As a result, the combination of von Neumann architecture, IC, and silicon enabled the co-evolution of the computer and chip to deliver exponential advance.

Energy is consumed while doing work. The amount of work that an IC can do, in other words the IC’s performance, is limited by the power supplied and the heat that can be removed. Therefore, chip performance can be raised by improving energy or power efficiency, where power is the rate of energy flow.

Chip power efficiency has improved by three orders-of-magnitude in the last 20 years, rising to 1/100 times that of the brain. At the same time, the level of chip integration has reached 1/100 times that of the number of neurons in the brain. If these trends continue, the chip can catch up with the brain in 10 years.

However, because the von Neumann architecture requires a large amount of data and commands to flow back and forth between the processor and memory, the interface has become a bottleneck (the von Neumann bottleneck). Furthermore, as device dimensions dropped below 100 nm at the beginning of the 21st century, quantum effect began to surface, and silicon chip leakage current cannot be suppressed anymore. The growth of the computer and chip born half of a century ago is approaching its limits.

But before reaching its limits, the computer acquired the ability to learn on its own. It is machine learning. And AI chips which mimic neural networks of the brain were born.

AI Chips Learning from The Brain

Although the underlying technologies required for the design of neural networks were developed during the 20th century, the space represented by neural networks was too vast such that it was difficult to train deep neural networks with more than 4 layers.

However, as we entered the 21st century, with the successful creation of deep autoencoders and the achievement of sufficiently large computer performance required for training, deep learning became capable of delivering overwhelmingly larger processing performance than conventional processing, resulting in its rapid adoption.

Research on both network structure and architecture has also advanced. CNN (Convolutional Neural Network) for image recognition where only nearby signals are coupled was successfully developed. Furthermore, research was conducted into networks for recognition processing of time series data such as voice and natural language, including RNN (Recurrent Neural Network) and LSTM (Long Short Term Memory). More recently, the transformer architecture which uses a self-attention mechanism to focus attention on the important parts instead of using the recurrent construction of an RNN has gained prominence.

All these research efforts take their hint from the human brain. Especially important among them is the pruning of neural networks.

Although we are born with only about 50 trillion synapses in our brain, a year after birth the number grows to 1000 trillion. However, thereafter, as we learn, the number of synapses decreases. While synapses with signal passing through get strengthened and remain, unused synapses which receive no signal are pruned and disappear. By around age 10, the number of synapses is reduced to half, and remains roughly the same afterward.

In other words, while the neural network that forms in our brain during our early childhood is close to a fully coupled network, as we learn, unnecessary wiring is removed leaving only necessary wiring. This is how we develop a lean and efficient functional neural network in our brain.

Children have a large brain to facilitate learning. But an adult’s brain is pruned in order to perform inference efficiently. Born small, raised to grow large, and made to learn in society is the survival strategy of mammals equipped with a well-developed brain.

Human and Silicon Brain

Summarizing our discussion so far regarding the human and silicon brain, in transitioning to a pre-programmed state, the von Neumann computer, born out of mathematics, is able to perform robust information processing. It works like the thalamus, amygdala, and cerebellum of the human brain whose functions are determined through inheritance.

On the other hand, wired logic-based neural networks, taking lessons from the human brain, continue to learn as an open system while being pruned, and perform time-irreversible, flexible information processing with high energy efficiency. They work like the cerebral cortex which learns and develops in society.

Silicon brains can thus be described by referencing to the human brain (as shown in the figure below). Does that mean silicon brains will assume the same structure as the human brain?

Fig. Silicon Brain
Processor serves the role of the thalamus, amygdala, and cerebellum, while neural networks serve that of the cerebral cortex.
(S: Sensor, A: Actuator, P: Processor, M: Memory, NN: Neural Network).

“That is an amazing dynamic range!” exclaimed Mr. Kazuyuki Aihara in 1981, who at the time was a senior of mine in my research lab (and is now an honorary professor at the University of Tokyo). He was referring to the large variation in resistivity of the nerve axon he found by simulating and analyzing using the Hodgkin-Huxley model, which is a set of nonlinear differential equations that describe the initiation and propagation of action potentials in nerve axons.

It is not easy to artificially create something with similar characteristics. The human and silicon brain may end up having different structures based on different principles, just like how birds and airplanes have become.

Neural networks have a wired logic architecture where functionality is determined by the wiring. As a result, I have great expectations for FPGA (Field Programmable Gate Array) which has programmable wiring.

Chapter -12-

Semiconductor Industry Strategy

Think Two Steps Ahead and Strike


2021.11.01  Format for Print

Game Changer

In June of 2021, Japan’s Ministry of Economy, Trade and Industry (METI) published its strategy for the country’s semiconductor industry. One of the documents released was entitled “Japan’s Decline.” It pointed out that, while Japanese companies collectively held a 50% share of the world market in 1988, their share suffered a free fall after that, dropping to only 10% today. That caught the nation’s attention.

In the last 30 years, while the global semiconductor market continued to grow robustly at an annual rate of more than 5%, Japan’s market remained flat. If the trend continues, Japan’s share may drop to a negligible level. Meanwhile, riding on the wave of digital transformation, the world market looks to accelerate its growth to an 8% annual rate, doubling to US$1 trillion by 2030.

Is there any way to turn Japan’s semiconductor industry around?

The key to success in the semiconductor industry is, simply put, aggressive investment in process scaling.

Unfortunately, it will not be easy for Japan to make up for lost time in the last 30 years using just conventional approach. We need to predict the next battleground and make investment accordingly ahead of the competition. To follow the practice of sen-sen-no-sen-wo-utsu in the Japanese martial art of kendo, which literally means to sense your opponent’s imminent attack and strike first, we need to think two steps ahead and strike.

To make sense of the current complex environment, it is necessary to understand the three transformations shaping it.

The first is the changing of the guard in the industry. Specifically, the battleground for the logic chip business is shifting from general-purpose chips developed by chipmakers such as Intel to specialized chips developed by chip users such as GAFA.

Let’s take a look at the investments made by 25 influential venture capital companies in America in the 3 years from 2017. An amount equal to 9 times the investment in the memory business was invested in the development of specialized and AI chips.

The age of specialized chips has arrived. (Chapter 1)

Although mass production of standardized general-purpose chips has been the mainstay of the semiconductor industry, there was a time before when specialized chips, custom-made in small volumes, flourished. It was between 1985 and 2000. It was driven by the production cost saving in merging scattered logic circuits used to glue general-purpose chips together into one specialized chip.

The challenge of specialized chips is its high development cost. To bring that cost down, universities in the US developed computer-automated design technologies one after another.

Unfortunately, in the 15 years that followed, on-chip device integration density grew by three orders of magnitude because of Moore’s Law, eventually driving design cost up again to prohibitively high levels, bringing the era to an end.

But now a game-changing transformation is again taking place, driven by an energy crisis. The use of AI to analyze an explosively growing amount of data consumes a swelling amount of energy. As a result, specialized chips, which consume orders-of-magnitude less energy than general-purpose chips through the elimination of wasteful circuits, are once again in high demand.

While diverse functionalities continue to be implemented in software running on general-purposed chips, AI processing is accelerated through hardware implementation using specialized chips. In other words, to grow while going green requires combining the best of two worlds.

Paradigm Shift

The second transformation is a wave in the market.

A new wave arises in the semiconductor market about once every quarter century. The time is now for the next wave.

From 1970 to 1995, it was the wave of home appliances; from 1985 to 2010, it was the wave of personal computers. We are now on the wave of smartphones that started in 2000 and will likely last until 2025. Japan was able to catch the first wave but missed the second and third. It is important that we are ready for the upcoming fourth wave.

Home appliances deliver convenience in the physical space using analog technology. On the other hand, personal computers create cyberspace using digital technology, while smartphones make cyberspace portable using wireless network technology.

The upcoming fourth wave aims to grow the economy and solve societal problems through extensive fusion of cyber and physical space by using sensors, AI, and motors. In other words, it ushers in the human-centric Society 5.0, the next evolution of society, which utilizes digital twins.

One example is robotics that includes mobile robots such as motor vehicles and drones.

According to futurist researcher Hans Moravec, the intelligence level of robotics today is only that of a mouse. But he predicts that it will advance to that of a monkey by 2030, and approach that of human by 2040. That will allow intelligent robots to evolve their roles from mobility, logistics, and services to medical care, nursing, and entertainment.

These are markets where Japan can lead, given that it is the first country to face a lot of the emerging problems. These are also areas where Japan can take advantage of its strength in adapting to the physical world.

To be sure, the fourth wave will not stop there. It will go as far as our imagination. As soon as another idea arises, we need to be able to quickly turn it into reality through implementation in chips. It requires agile chip development.

The third transformation is a paradigm shift in technology.

Computers in the 1950’s adopted the wired-logic architecture and were programmed by rearranging the interconnections between arithmetic units.

But this architecture faces two challenges. The first is the challenge of scale where the largest realizable program is limited by the scale of the hardware which is fixed at the time of development. The second is the challenge of wiring where the number of physical interconnections in a large-scale system is too large to manage.

To overcome the first challenge, mathematician von Neumann invented the stored program architecture. In this architecture, which is also known as the von Neumann architecture, the data to be processed and the commands to direct the movement and processing of the data are first stored in memory, The processor then retrieves the commands in order, interprets them one at a time, and performs the corresponding computation. It was a revolutionary approach to solve the challenge of scale by converting from varying the physical wiring between multiple arithmetic units to the execution of varying commands, one per clock cycle, by a single processor to perform different functions.

On the other hand, after approaching the problem from various angles, the solution to the challenge of wiring that emerged was the integrated circuit (IC) invented in 1958 by electronics engineer Jack Kilby. It was an ingenious solution where photolithography is used to integrate many devices on one chip and connect them together all at once.

These two core approaches have been the driving force of the industry for more than half of a century. However, two paradigm shifts are now moving the industry away.

First, there is a shift from the von Neumann architecture to neural networks. ( Chapter 11)

Instead of repetitively and sequentially processing data that are moved back and forth between the processor and memory, neural networks process data simultaneously in parallel as they flow through the network. The result is a major improvement in energy efficiency.

The use of the von Neumann computer architecture resulted in mass consumption of processors and memory chips. But future growth is anticipated for the specialized chips which implement neural networks for AI processing. The core hardware for the industry will change from processors and memory chips to the physical wiring of neural networks. This is just like the evolution in living organisms from the brain stem and cerebellum to the cerebrum.

At birth, our brain has only about 50 trillion synapse connections. But the number continues to grow and increases by 20-fold by the time we enter elementary school. After that, as we learn, unused connections are gradually removed, resulting in a highly efficient brain network without waste at maturity. In other words, our brain is incomplete at birth, grows through playing, and achieves high efficiency through learning.

The formation of neural networks follows a similar path. There is currently a lot of active research on pruning of neural networks through machine learning.

The second paradigm shift moves the industry from process scaling to 3D integration. (Chapter 4)

Process scaling is finally approaching its limits. 3D integration can reduce the energy consumed in moving data by orders-of-magnitude. It is like taking the data that you previously had to go to the National Diet Library to retrieve and placing them within your arm’s reach.

As a result of these paradigm shifts, we must once again overcome the two fundamental challenges we faced in the 1950’s. It presents an opportunity for disruptive technologies to shine as we approach the end of Moore’s Law.

Green Growth Strategy

From our analysis so far, we can see that the energy consumption problem is the root cause of various transformations in the industry. In order to increase energy efficiency, the industry is going through a changing of the guard from general-purpose to specialized chips, computer architecture is shifting from von Neumann to neural network, and the focus of integration technology is transforming from scaling to 3D integration.

Meanwhile, society is evolving from the capital-intensive industrial society to a knowledge-intensive intelligent society. Value is no longer found in low-cost chips that integrate a massive number of transistors. Instead, value is being shifted to the ability to process large amounts of data in a highly energy-efficient manner and the superior services that such ability creates.

However, going forward, the move towards carbon neutrality will create restrictions that weigh heavily on the industry. We must cut energy consumption aggressively. We will certainly have to transform our growth strategy from the current greedy strategy to a green strategy.

Fig. Transition from Greedy to Green Growth Strategy

The “three arrows” of the green growth strategy are the creation of 3D integration technology to eliminate the bottleneck in data processing, the construction of a platform for agile development of specialized chips, and the preservation of a domestic ecosystem.

Without improvement in energy efficiency there will be no growth; without improvement in development efficiency there will be no specialized chips. In other words, the highest priority challenge going forward is the pursuit of time-performance. Since time is money, the conventional cost-performance is also factored into time-performance.

Former British Prime Minister Winston Churchill, in addressing the challenges faced by his country, said, “One ought never to turn one's back on a threatened danger and try to run away from it. If you do that, you will double the danger. But if you meet it promptly and without flinching, you will reduce the danger by half.”

Meanwhile, Intel co-founder Robert Noyce once said, “Optimism is an essential ingredient for innovation. How else can the individual welcome change over security, adventure over staying in safe place?”

With resolve and optimism, let’s turn Japan’s semiconductor industry around, starting today.