Age of Great Chinese Dragon: Supercomputer Centers and High Performance Computing

: Author describes Chinese supercomputer centers and networks. There are currently five National Supercomputing Centers in China, which are established in Tianjin, Shenzhen, Shanghai, Jinan, and Chang-Sha. These cities were selected as pilot ones for conducting an experiment on the development of the cloud computing services market. The directions of evolutionary and innovative development of exaflops supercomputers are highlighted in Pacific region. The evolutionary approach is the simplest and allows you to quickly get the result, but the created supercomputer of this type will be effective only when solving a narrow class of problems and have low energy efficiency. An innovative approach involves basic research and the development of innovative technologies, which is much more complicated and requires more time. Innovative technologies for the development of exaflops supercomputers, due to the stringent requirements of energy efficiency and productivity efficiency, have much in common with technologies for creating highly efficient on-board and embedded systems. These technologies are called exascale, they should provide the ability to create single-board on-board supercomputers of the teraflops level and single-rack supercomputers of the petaflops level of performance. The main problems of creating exaflops (exascale) systems: increasing the overall performance of the system by three orders of magnitude while weakening the influence of Moore's law on the performance of an individual processor core; minimization of energy losses and performance losses associated with data access, information transfer at all levels of the supercomputer hierarchy and data storage

Within the framework of the national program 863 implemented by the Ministry of Science and Technology, which began in March 1986 (hence the name), the PRC leadership in 2006 adopted the five-year China National Initiative of High Productivity Computer and Grid Service Environment project with a budget of about $400 million. Of this project 863/IT-the development of special, or custommade, strategic IT and the creation on their basis of promising stationary and on-board supercomputers designed to ensure national security and solve the most important scientific and technical problems.
At the initial stages of the project, the use of the latest foreign technologies was allowed, but the main idea was to achieve full technological independence in the field of strategic IT [2].
One of the first results of the 863/IT project was the creation of the Milky Way-1 (Tianhe-1) supercluster of petaflops performance level and two supercomputers of the same type with a capacity of about 100 TFLOPS. The Tianhe-1 supercomputer, created at the National University of Defense Technology, contains 6144 Intel Xeon E5450/5540 microprocessors, 5120 AMD ATI Radeon HD 4870 GPUs and has a memory of 98 TB. Along with the sensational appearance of the Chinese supercomputers Tianhe-1, Nebulae and the fastest supercomputer Tianhe-1A at the end of 2010, information appeared on China's initiatives in developing its own microelectronic element base. Some of its samples have already been used in Tianhe-1A: microprocessor FT-1000; VLSI communication networks Arch-NIC and NRC. In more powerful samples of supercomputers with a performance of 10-100 PFLOPS, it is planned to use new VLSIs of their own design, for example: multi-core superscalar microprocessors with powerful processing units for short vectors of the Godson 3 and Godson 4 families; new generation multi-thread microprocessors FT-1500 and FT-2000 with the number of hardware-supported threads, 128 and more [3].
Microprocessors FT-1000/1500 are microprocessors with an average number of threads. The microprocessors currently under development are little known: FT-2000 and specialpurpose microprocessors -massive multi-thread microprocessors with hardware support for globally addressable memory; hybrid streaming and multi-thread microprocessors for special purposes.
Work is actively ongoing in the field of ultra-highperformance onboard microprocessors, the following developments are known: FT-64-a microprocessor of the stream type of stream-based class, effective in processing signals and images; MASA is a multi-core multi-thread streaming microprocessor. It is known about plans to use inflight systems of massive multi-thread microprocessors. The task of developing an onboard radiation-resistant supercomputer of the MAESTRO type from Boeing (it was developed in the USA for space systems) is being solved.

Roadmap and Geographic Topology
There are currently five National Supercomputing Centers in China, which are established in Tianjin, Shenzhen, Shanghai, Jinan, and Chang-Sha. These cities were selected as pilot ones for conducting an experiment on the development of the cloud computing services market.
In the "Medium-and Long-Term State Program for the Development of Science and Technology" in China, the development of high-performance computers is included in the list of priority areas for the development of domestic science and technology [4]. The Ministry of Science and Technology of China clearly set the task to master the key technologies for the development and manufacture of supercomputers capable of performing over 1 quadrillion operations per second. The first and second state supercomputer centers in China were established in the northern coastal city of Tianjin and the southern city of Shenzhen.

Center in Tianjin
Tianjin Supercomputing Center (National Supercomputing Center in Tianjin)-an open (commercial) site of the University of Defense Technology of China (NUDT).
The presence of the first Tianhe-1A supercomputer in China allows the Tianjin Center to expand its work on providing cloud computing services in such fields as oil exploration and development, biopharmaceuticals, multidesign, and financial risk analysis.

Center in Shenzhen
In the southern city of Shen-Zhen, there is another National Supercomputing Center in Shenzhen, located ten kilometers from Hong Kong and representing a technology center in the free economic zone, there are no secret works.
Foreign companies finance and supply equipment, the largest common technology platform, created by the University of Hong Kong with the support of MOST, it is the lead in connecting the Chinese Grid with the transatlantic telecommunications network in the USA and Japan.

Center in Jinan
In the city of Jinan, the administrative center of the coastal province of Shandong (East China), the third National Supercomputing Center in Jinan was officially opened in China in autumn 2011, in which computing systems were first installed with peak performance of more than one petaflops (SunWay supercomputer), equipped exclusively with domestic processors and software.
This means that China has become the third country, after the USA and Japan, in the world, capable of creating a supercomputer with peak performance of more than one petaflops using its own central processors. As it became known, the Jinan Supercomputer Center began to build the Shandong Provincial Academy of Sciences in March 2011. Recently, a new center has been built and commissioned [5].

Center in Shanghai
The Shanghai Supercomputer Center has a supercomputer called the Dawning-5000A Magic Cube, which can perform up to 180 trillion floating point operations (teraflops) in just one second on the Linpack task. The maximum load on the "Magic Cube" can reach 230 teraflops. Its scope is weather forecasting, architectural and engineering calculations, modeling for aviation needs, and earthquake prediction.
The supercomputer covers an area of 75 square meters and consumes about 700 kilowatts per hour.
Funding for the creation of the Dawning 5000A came from the budget of the Ministry of Science and Technology of the PRC, as well as from the city budget of Shanghai. The cost of the computer was 200 million yuan or $29 million.

Center in Chang-Sha
In the city of Chang-Sha, the construction of the fifth Hunan Supercomputer Center (National Supercomputer Center in Hunan) has been completed. The complex is designed to host Tianhe-1A computers, which are the most powerful systems in the world. The complex is located on the territory of a local university. The Tianhe-1A computer is the world's most productive supercomputer, which in November 2010 topped the Top500 rating. The computing power of this system is 2.566 petaflops. Tianhe-1A uses 7168 Nvidia Tesla M2050 GPUs and 14336 Intel Xeon server processors, which together consume 4.04MW of power. The Chinese say that the computer in Changsha will be used for calculations in the framework of weather forecasting, pharmaceutical development, research. Xinhua News Agency also reports that on the computer will be calculated animation for Chinese cartoons.

Other Data Centers
In Beijing, in the administrative zone of Bihai, the Institute of Precise Sciences and Engineering Processes of the Chinese Academy of Sciences (Institute of Process Engineering, Chinese Academy of Science) and the Main Laboratory of the Ministry of Science and Technology of China (State Key Lab of Computational Physics) are located. These organizations are engaged in predictive modeling, technical calculations, aerohydrodynamics, solid state physics, microbiology, nuclear explosion simulations, air defense and missile defense, meteorology and oil and gas exploration. The Academy of Sciences has also created the China Computer Network Information Center at the Academy of Sciences (ICT), a network information center for Computing Machines, acting as the state Internet information center. On the same day, a CNIC working committee was created.

CERNET2 and Security Goals
China began in 2005 the project of creating the second generation I2 Internet, which consists of several dozens of nodal points-GigaPoPs (gigabit-capacity points of presence), connected by a common highway called Backbone Networks. Today, China's educational and research CERNET2 network (China Education and Research Network) connected together 25 universities and 20 cities, becoming the largest in the world in its class.
The rapid development of the Internet, especially its expansion to numerous mobile devices and consumer electronics, will soon lead to the exhaustion of the address space of 4.2 billion addresses. IPv6 will solve this problem for centuries to come. IPv6 envisages the principle of 128-bit organization of the address space on the Internet, in which the theoretically possible number of hosts is 10 to the 38th degree. This is enough in excess to allocate a separate IP address to each physical object on earth. Most of all, new addresses are needed in Asian countries, where the Internet is developing faster than in its homeland, USA. 74% of IP addresses belong to the United States, while China, with its 80 million users, occupies about the same amount as the university campus of California.
CERNET2 network speeds can reach 40GB/s, which is 4-20 times the average for existing university networks in the world. CERNET2 works only over IPv6. Network equipment was manufactured in China by Huawei Technologies and Tsinghua Bit-Way. The founders of the network plan to cover more than 100 universities in the near future.
Distributed denial of service (DDoS) attack has quickly established itself as one of the most common and devastating attacks in cyberspace. Despite its simplicity, DDoS attacks often have extremely devastating consequences. More recently, in China, hackers who staged DDoS attacks were caught. They used simple hacker attacks to disrupt communications services and destroy competitors' businesses. In response to growing concerns about DDoS, IT administrators around the world have recognized professional solutions to combat DDoS as an indispensable way to protect sensitive IT environments, business and government from attacks.
Huawei Symantec's anti-DDoS solution consists of three main parts (see below), including a detection center, a cleaning center, and a control center. These integrated components provide a high-quality policy and management integration, providing customers with a variety of professional solutions to combat DDoS attacks, tailored to the individual needs of the client.
The national federal level provider CNGI-CERNET2 consists of: a) Detection center: The detection center in this solution acts as an "antenna". It provides the acquisition of detection protocols from the control center, the detection and detection of DDoS traffic, as well as the sending of detection results back to the control center. Currently, the USG9310/9320 or USG5300ADI from Huawei Symantec is used as the detection center. b) Service and control center: The control center, being the brain of this solution, provides users with Internet access2 and controls communication channels (speed, bandwidth, etc.), monitors the use of telecommunication equipment resources. The National Center has deployed 20 virtual private networks (the network unites 20 cities, one regional virtual network in each city) based on the Huawei Symantec SVN3000 hardware and software complex, which is a virtual private network (VPN) security access protocol for secure connections (SSL)/IP Security Protocol (IPSec) with high capacity and high reliability. Through the SSL VPN feature, the SVN3000 provides secure, convenient, and easy-tomanage remote access for terminals, such as business trip staff, SOHO staff, partners, and customers. The SVN3000 supports IPSec and cost-effective remote branch office communications. The SVN3000 is based on Huawei Symantec's secure hardware platform and distributed realtime operating system. The SVN3000 deeply integrates VPN technologies and features such as SSL VPN, IPSec VPN, firewall, authentication, authorization, and accounting (AAA). Combining their strengths and capabilities, the SVN3000 provides an excellent access security solution for organizing IP VNP networks. It also supports precise control of authorized access to internal network resources. It provides both international encryption and decryption algorithms, and the national encryption standard (Dragon-Xing protocol, developed by Professor Den Lu of China's NUDT).
SVN3000 provides a comprehensive remote access feature for internal network applications. It supports remote terminal access to web-based applications for immediate access. It also supports access to applications such as secure web server access, file sharing, Notes, Exchange, file transfer protocol (FTP), Oracle, Telnet, Secure Shell (SSH), reliable data protocol (RDP), and virtual network computing (VNC). For the rapid development of new service applications in the future, the SNV3000 supports VPN tunnel mode to implement access support for all services.
In addition to general authentication modes, the SVN3000 supports authentication and authorization modes based on user names and external authorization platforms, such as the Remote Authentication Dial-In User Service (RADIUS), the Simple Directory Access Protocol (LDAP), SecurID, X. 509 and USBKey + digital certificate. A variety of authentication and authorization helps administrators centrally configure and manage user access, which dramatically reduces support and maintenance costs. In addition, the SVN3000 provides system logs, administrator logs, and user access logs. It supports viewing logs by category and exporting them in real time. This helps the administrator perform external analysis and audit of the logs. For SSL VPN services, the SVN3000 provides a diverse and convenient WebUI management interface in Chinese and English. Using the web-based management interface, the administrator configures the resources and the SSL VPN user accordingly, which provides real-time monitoring and control. As a professional access security gateway, the SVN3000 supports command line management and simple network management protocol (SNMP).
SVN3000 supports advanced virtual gateway technology. Several SSL VPN systems are implemented on a single device using virtual gateway technology. Each of these virtual systems has its own administrator, user options and configuration, which protects the system from communication with each other. Virtual Gateway Technology delivers unprecedented device scalability. This is the main technology for enterprises and operators to further improve security and provide security services, which saves the investment of enterprises and operators. One SVN3000 provides up to 128 virtual SSL VPN gateways.
Using Huawei Symantec's proprietary hardware platform, dedicated real-time operating system, and professional routing platform (VRP), the SVN3000 Access Security Gateway can provide better system performance and security than traditional VPN systems based on a universal system. The SVN3000 provides a standard dual-power configuration and supports hot standby.

ASIC Miners and High Performance Computing
Since February of this year, there have been rumors among cryptocurrencies about the development of Bitmain's Antminer F3 ASIC miner for the Etherium cryptocurrency, which uses the Dagger-Hashimoto hash function (or V. Buterin's Ethash algorithm). Previously it was believed that only a graphic processor (GPU) could handle this algorithm, because of this, the algorithm was invented to exclude the possibility of using ASICs oriented to high-speed computing, but not to work with large memory with high bandwidth. This was the strategy, the GPUs from NVIDIA and AMD sold out well, the range of users was quite wide due to the prevalence of GPUs, the number of GPUs in the Etherium network is estimated at 7.5 million.
Leading domestic application experts also stated that only the GPU can handle the Ethash algorithm (and other hash functions of this class of memory-hard computing), and this is another argument in favor of purchasing a license for a foreign GPU with the aim of repetition in Russia. Nevertheless, as expected by domestic experts on processor architectures, these ideas about the exclusivity of foreign GPUs turned out to be erroneous, which will be proved by the appearance of Antminer F3 (its first industrial implementation appeared in early April 2018 as an Antminer E3 miner, see. further) and is commented on in this reference.
There are practically no reasons to not believe in miners of this type, even if their first industrial implementations will not have such critically high characteristics as predicted. The Dagger-Hashimoto algorithm first builds an acyclic oriented graph of large size, it is a graph of recursive function calls, now it takes up to 4GB, and then using SHA256 hash functions (there is an option with SHA-3) in the process of "lazy" execution of function calls from this graph, the values of some unknowns are searched for that must satisfy the given condition for such a calculation. This computational process requires not only a large amount of memory, which obviously cannot be the internal memory of the microprocessor, but only external memory in the form of DDR-or GDDR5-memory GPU. This memory should also have high throughput, since accesses to it take dozens of processor cycles and to ensure high performance they need to be parallelized. In addition, which is also extremely important, high throughput should be achieved under conditions of intensive irregular access to it with poor spatial and temporal localization of memory access addresses.
GPU-based cards have not only large and high-bandwidth GDDR5 memory. The key property is different, GPUs have a massive multi-thread architecture, they support asynchronous threads-WARPs and synchronous threads for CUDA cores, which make it possible to provide a powerful stream of memory accesses from them, which is necessary to use high bandwidth GDDR5 memory. Thus, the main property of the GPU architecture that is used when the Ethash algorithm calls graphs is irregularly used is the asynchronous multithreading provided in the GPU in hardware. This was used by the Chinese developers of the Bitmain company when creating the ASIC miner Antminer F3. Details of this are discussed below.
The appearance of the Antminer F3 miner was expected in the second or third quarter of this year. It was predicted that Antminer F3 will be equipped with three motherboards, in each of which six ASIC processor chips are installed, as well as 32 DDR3 RAM modules of 1GB each for each of these processors. Thus, the total amount of memory in the miner will be 72GB, addressed through a single address space, which is required when working with graphs.
The performance of the miner was predicted at 200-220MH/s (according to other sources-up to 1500MH/s!!), and the power consumption at the same time was not less than 500W. The predicted price of the miner was from $2500 to $3000. The indicated productivity of 200-220MH/s corresponds to a miner farm of 6-8 video cards with GPUs. In terms of cost, Antminer F3 then beats quite a bit against such a farm-even at retail prices such a farm will cost about 4.5-5 thousand dollars.
We assume that the power consumption of Antminer F3 is at least 500 watts. The above farm on the GPU with proper configuration will consume 600-700 watts. Thus, according to the combination of key characteristics (performance + power consumption), the new "killer of GPU miners", as Antminer F3 presents, will surpass the usual GPU farm by 2.5-3 times, which, according to critics, is hardly a breakthrough. Based on the fact that there are 32 memory crystals per Antminer F3 miner ASIC processor, it was concluded that this was done to obtain not only a large amount, but also to achieve strong stratification of the memory in order to ensure its high throughput due to parallelization perform operations with her. The next logical conclusion is that either a vector or a multi-thread processor can work with such memory, since it is necessary to be able to issue a large number of memory accesses from programs so that there is something to parallelize. A vector processor for working with graphs is less suitable than multi-threading because of the unpredictability of accesses in the address space for graphs. Thus, the ASIC-processor of the Antminer F3 miner must surely be multi-thread with the number of hardware-supported threads 64-128, no less. At the beginning of this decade, according to information from the expert community, it was found that in China, starting from 2008-2009, the NUDT (University of China's Defense Technology, controlled by military intelligence) was developing the CT-2 military supercomputer ("Thunder Clap"). This project was based on architectural studies of the Russian project "Angara", the base microprocessor of which was the microprocessor J7 (its advanced version with 256 threads is the microprocessor J10). The CT-2 supercomputer was successfully implemented and, from 2017, was to be in combat operation at the Military Intelligence Center in the area of Sin-Yuan. There is no more information of other developments of multi-thread microprocessors in China.
There was a hypothesis, is the Antminer F3 miner ASIC processor an upgraded version of the CT-2 supercomputer microprocessor? Communication in the chats of Chinese experts on this issue made it possible to establish that this is indeed so. It turned out that technical documentation and developments on the multi-thread microprocessor for CT-2 were transferred from NUDT to Bitmain in order to ensure the creation of the Antminer F3 miner ASIC processor on its basis. Since this microprocessor was a variant of the Russian J7, the architecture of which the help authors developed and are well aware, probably when developing the Bitmain ASIC processor, the virtual address translation subsystem was simplified, and special accelerator blocks were introduced to calculate the SHA-256 or SHA-3 functions, to which have hits in the Ethash algorithm.
It was also found that Han Tao, who was previously noted in Help N1 as one of the main specialists of the VLSI Laboratory at Fudan University in Shanghai, was concerned with the issues of applied cryptography in Bitmain. This laboratory works closely with the University of Zhang-Zhou on the VLSI crypto-VLIW project; it is described in the Authors N1 Help. Han Tao has long collaborated with Bitmain's chief systems architect and its founder, Mikri Zhang. Works on ensuring information security for the Antminer F3 miner (protection against attacks) was led by Qing Bo Wu, OS developer Kilin (Chinese version of Linux) and the leader of the main group of Chinese hackers "Red Dragon". Han Tao, Qing Bo Wu and Mikri Zhang previously worked together at the Institute of Computer Technology of the Chinese Academy of Sciences (ICT).
This institute was the leading one in the Chinese analogue of the American program for creating supercomputers with promising DARPA HPCS architecture. As the Russian analogue of this program, the ACSN Angara project was conceived, but it was not supported at the state level.
The last thing that can be noted on the Antminer F3 (E3) miner is that instead of the promised 1500MH/s, it turned out so far 180MH/s. Most likely, these are "growth difficulties" during implementation. Perhaps, there are still not 6 processors on the board, or these processors do not yet have enough fast accelerators. Perhaps, the interprocessor network inside the miner, or something else, has not yet been worked out. Another thing is important, a fundamental step has been taken-the introduction of an independently developed specialized multithread microprocessor and it is not a copy of existing GPUs.

Result
Preliminary conclusions regarding VLSI computing miners are as follows.
It is most likely that for these VLSIs there was a common prototype, created and studied as a result of the joint work of several state scientific institutions in China.
It was possible to find specialists in Chinese chats related to work on computing VLSI for miners and to obtain general information about such VLSI.
General information about VLSIs of interest is as follows: a) It is a hybrid multi-tile microprocessor; b) The tiles contain the same part for the tiles in the form of a control RISC processor and memory, as well as a special part from the series of special complexfunctional devices; c) Loading tiles with tasks is performed in dynamics. VLSI computing boards look like a tree structure, tasks are distributed over the VLSI board in dynamics: a) In these developments, especially in recent years, the following architectural techniques of "programmed" specialization: b) Using of a multitude (of the order of ten) heterogeneous specialized operation algorithms for cryptography of complex functional devices (hereinafter-PE, processor element); c) Large decision fields (CA, calculation area) for parallelconveyor processing of information in the form of program-controlled and/or reconfigurable series of heterogeneous PE connected to each other by switches with a large number of inputs and outputs for regular data transmission between adjacent rows of PE; d) Auxiliary irregular data transmission paths between arbitrary rows of PE through the common for these series fast memory; e) Auxiliary processor cores such as RISC processors to perform auxiliary functions that cannot be performed on the ranks of specialized Pes. f) Decision fields for parallel-sequential processing of information is characteristic not only of the cryptography field, but also of other application areas, for example, for processing signals and images.

Discussion
The increase in peak performance to exaflops level is due to an increase in O (10)-O (100) times the number of computing nodes, as well as an increase in O (10)-O (100) times the peak performance of the node itself. The scatter of these characteristics is explained by the fact that the developers will use different approaches to the construction of a computing node, namely: a) Based on microprocessors with "light" cores with low performance, then there will be more such nodes in the supercomputer; b) Based on microprocessors with "heavy" highperformance cores, then there are fewer such nodes; c) Based on the joint use of microprocessors with light and heavy cores or even hybrid microprocessors with cores of these types. Note that using these approaches, an increase in productivity by O (10)-O (100) times is ensured by an increase in node parallelism by O (100)-O (1000) times, and on the scale of the entire system by O (1000)-About (10000) times, this is the strongest growth of all the characteristics. Concurrency of a node is determined by the number of processor cores multiplied by the number of hardwaresupported threads in a separate core. In the 2010 petaflops supercomputer, the node had two 6-core microprocessors with one thread in the core. In transpetaflops (dozens of petaflops) the node parallelism is about a hundred, and in nodes with graphic accelerators-a few hundred. In an exaflops supercomputer, node parallelism is expected to be from a thousand to ten thousand, and it is assumed that there will be effective means of interaction and synchronization of such a multitude of resources.
Such an increase in node concurrency is associated with the solution of two problems: a) Problems of limiting the influence of Moore's law on the growth of productivity of an individual core; b) Problems of the "memory wall" due to the large delays in performing operations with memory and the network in comparison with the execution times of arithmeticlogical operations in processor cores (causes an increase in the number of hardware-supported threads in the cores, due to which the tolerance of the cores is ensured to delays in memory and network operations). Both problems began to be solved by increasing parallelism in the promising petaflops level supercomputers that were created in the last decade. In projects for creating exaflops supercomputers, this will continue. By the way, the previously mentioned possibility of a "head-on" solution for constructing the first samples of exaflops supercomputers through the use of graphic accelerators and other type of accelerators is connected precisely with the presence of a large number of cores and threads in them, as well as means of minimizing memory accesses, although they are implemented sufficiently specific. The construction of improved innovative models of exaflops supercomputers will require new options for using the huge parallelism of kernels and threads, their interaction and synchronization due to a radical change in the architectural and software appearance of these supercomputers, the main features of these changes, for example, were successfully formulated in [6][7][8][9]: "Future exasystems will be very different from traditional architectures with distributed memory and messaging. Because they will: a) Support the global address space (Global Address Space-GAS), and not distributed memory, b) Systems of multi-threaded, but not multi-processor architecture, c) Built on light synchronization mechanisms for fast transfer of control, and not the principle of global barriers, have dynamic resource management systems, and not rely only on the static distribution performed at compile time, d) Rather rely on a powerful runtime system than on slow OS services, e) Process the data immediately as it is ready, moving the code to the data, as opposed to the way it is now used in message transfer models: moving data to the place where it will be processed sometime in the future, f) Rely on microarchitecture, built more on the ideas of processing data flow (data-flow), rather than on processor cores, intensively using speculative execution of commands-this saves energy, g) Use Embedded Memory Processors (EMPs) rather, in order to reduce delays and increase the sampling speed for processes intensively working with data, even in the case of poor temporal locality, than rely on a slow and inefficient one energy view, data migration over the cache hierarchy, h) Use lightweight control points directly in memory, rather than a rough restart from disks for reliable recovery in case of failures, use uniform methods for programming various types of nuclei in heterogeneous structures. i) Concurrency must be provided in the processor, but it must also be used from programs".

Conclusion
Overcoming the exaflops level in a supercomputer requires the implementation, by and large, of completely new technologies, the distinguishing features of which are achieving high performance due to a sharp increase in parallelism and lowering the cost of data transmission and storage and at the same time high energy efficiency. These technologies can be applied to solve a wider range of problems associated with the creation of systems of different levels of performance and different weight and size characteristics, which is now required in society, namely: single-board supercomputers with teraflops performance level; one-rack-mount supercomputers of a petaflops level of productivity; multi-rack supercomputers of exaflops performance level. In connection with this wider application, the technologies created for ex-scale supercomputers are also called ex-scale technologies, investing in this name the meaning of achieving ex-scale efficiency, but also demonstrated at other levels of productivity. This is important to consider when considering development projects for exaflops supercomputers, bearing in mind their possible application at other levels of performance.
Execution of exaflops projects takes place in a certain historical context and in the form of the following processes of technology implementation and development.
First, the stage of overcoming the petaflops barrier of real productivity has been completed, and supercomputing technologies and systems developed at the same time are being implemented [8]. In the United States, this is more related to Cray and IBM. These achievements are strengthened in the implementation process with the use of new elementary design and architectural-software technologies of recent years. It is against the background of such an introduction with gradual improvement that the "evolutionary drift" to the first supercomputers with peak exaflops performance occurs. Such supercomputers are expected in the range of two to three years before the end of this decade.
Secondly, fundamental work on effective exaflops supercomputers for solving a wide range of problems has entered the active phase. These supercomputers are expected in the first third of the next decade and will use the ultimate capabilities of silicon technologies, the development of which, obeying Moore's law, will approximately complete at this time [10,11].
After a short time, these supercomputers will approach the physically determined performance limit of traditional supercomputers, which are essentially non-reversible (the Landauer restriction), the limit is now estimated at 32-128 exaflops, called the "Sterling point".
Thirdly, large-scale work has already been launched in many research groups to develop a new element base, which would be ready for practical application after the end of the Moore's law, i.e. will come in 10-15 years to one degree or another to replace and in addition to silicon technology. Currently, several such technologies are considered as possible candidates [12,13]. Critical opinions are known about these technologies among developers of traditional supercomputers, however, after studying materials and communicating with specialists, the authors of this work got the impression that in 2011-2012 there was a significant breakthrough in these developments. There is even evidence from Japanese scientists that the goal was to create a specialized reversible supercomputer on iota-flops quantum dots (1024 operations per second) of performance in the middle of the next decade within the framework of the Echelon-3 project and a promising missile defense system on the west coast of the United States and allies from the Asia-Pacific region and zettaflops-in 2020.
Exaflops supercomputers must have enhanced characteristics of the memory subsystem and communication network, namely, the following indicators are allocated: memory size, memory bandwidth, delay (time) of the memory operation, network bandwidth (point-to-point and bisectional), delay in network operations. Areas of evolutionary and innovative development of exaflops supercomputers are highlighted. The evolutionary approach is the simplest and allows you to quickly get the result, but the created supercomputer of this type will be effective only when solving a narrow class of problems and have low energy efficiency. An innovative approach involves basic research and the development of innovation [14,15].