An Energy-Efficient Deep Learning Processor With Heterogeneous Multi-Core Architecture For Convolutional Neural Networks and Recurrent Neural Networks

An Energy-Efficient Deep Learning Processor with
Heterogeneous Multi-Core Architecture for

Convolutional Neural Networks and Recurrent Neural Networks
Dongjoo Shin, Jinmook Lee, Jinsu Lee, Juhyoung Lee, and Hoi-Jun Yoo
School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST)
291 Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea
E-mail: imdjdj@kaist.ac.kr
Abstract: An energy-efficient deep learning processor is proposed for convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) in mobile platforms. The 16mm2 chip is fabricated using
65nm technology with 3 key features, 1) Reconfigurable heterogeneous architecture to support both CNNs
and RNNs, 2) LUT-based reconfigurable multiplier optimized for dynamic fixed-point with the on-line
adaptation, 3) Quantization table-based matrix multiplication to reduce off-chip memory access and remove
duplicated multiplications. As a result, compared to the [2] and [3], this work shows 20x and 4.5x higher
energy efficiency, respectively. Also, DNPU shows 6.5 higher energy efficiency compared to the [5].
(Keywords: deep learning, convolutional neural network, recurrent neural network, heterogeneous, LUT)
I. Introduction
Deep learning is being researched and used more and more widely because of its overwhelming
performance. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are the key
networks of deep learning, and both have different strengths. CNNs have advantages to extract visual
feature, and RNNs are capable of processing sequential data. CNNs are used for vision recognition such as
image classification and face recognition, and RNNs are used for language processing such as translation
and speech recognition. Also, by combining CNNs and RNNs, we can realize more complex intelligence
like action recognition and image captioning [1]. However, the computational requirements in CNNs are
quite different from those of RNNs. Convolution layers (CLs) in CNN require a massive amount of
computation with a relatively small number of parameters. On the other hand, fully-connected layers (FCLs)
in CNN and RNN layers (RLs) require a relatively small amount of computation with a huge number of
parameters. Therefore, when FCLs and RLs are accelerated with hardware dedicated for CLs, they suffer
from high memory transaction costs, low PE utilization, and a mismatch of the computational patterns.
Conversely, when CLs are accelerated with FCL- and RL-dedicated hardware, they cannot
exploit reusability and achieve required throughput. There are several works which have
considered acceleration of CLs, such as [2-4], or FCLs and RLs like [5]. However, there has
been no work which can support CLs, FCLs and RLs. Therefore, we present deep learning
processor which can support both CNNs and RNNs with high energy-efficiency for
battery-powered mobile platforms.
II. Processor Architecture

The overall architecture of the proposed deep learning processor is shown in Fig. 1. It is composed of the
CL processor (CP), FCL-RL processor (FRP), and a top RISC controller. The CP consists of 4 convolution
clusters, 1 aggregation core, and custom instruction set-based controller. The FRP performs matrix
multiplication with quantization (Q)-table-based multiplier, and it has activation function LUT to support
sigmoid and hyper tangent functions. The CP and the FRP have heterogenous architecture with their
characteristics. Fig. 2 shows detail architecture of the CP. There are 16 convolution cores and 1 aggregation
core. Unlike the previous works [2-3], it does not have global memory as a cache and has a distributed
memory form. Therefore, it can have larger space for processing logic, and higher energy efficiency with
shorter distance from memory to processing logic. Each convolution core has own image memory (16KB),
weight memory (1KB) and 48 processing elements. It consists of 3 types of networks, data NoC for image
and weight transaction, partial sum network for accumulation, and instruction network for simultaneous
core controlling. As shown in Fig.3, in the FCL, input neurons are fully-connected to output layer neurons.
And this equation can easily map into the matrix multiplication. In the case of RNN, it looks quite different,
but these equations can also map into the same form of the matrix multiplication. Therefore, the FCL and
RL can share the same hardware with the matrix multiplication architecture.
Fig. 4 shows the data distributions and data ranges of each layer of CNN, in this case VGG-16. Data
2017 IEEE COOL Chips 20 978-1-5386-3828-6/17/$31.00 2017 IEEE

distributions and data ranges change dynamically depending on the layer. Floating-point number is effective
to handle these large range of numbers. However, hardware costs of floating-point operations are much
higher than those of fixed-point operations. To take the advantages of both floating-point operation (covered
data range) and fixed-point operation (low cost hardware), layer-by-layer dynamic fixed-point with on-line
adaptation is proposed. Fig. 5 shows the proposed layer-by-layer dynamic fixed-point with on-line
adaptation architecture, and its optimized LUT-based reconfigurable multiplier. Previous work [3] took
advantage of dynamic fixed-point to reduce the word length. In this work, further reductions can be
achieved with the on-line adaptation via overflow monitoring. If the current output MSB cannot handle the
final accumulation results from the 30b accumulation paths, the overflow count is increased. It is repeated
for the 2nd MSB. Each overflow count is compared with the threshold (in this case 0.2%), and then the new
fraction length is calculated to fit the desired overflow ratio. With this scheme, we can achieve 66.3% top-1
accuracy with 4b word lengths, while 32b floating point shows 69.9%. In the convolution operations,
multiplication with the same weight is repeated 100- 100,000 times. Using this property, LUT-based
multiplication can be constructed with an 8-entry physical LUT and 8-entry logical LUT. Also, four 4b
multiplication and two 8b multiplication results are attained without additional cost, and those can be used
for the respective word lengths.
In the FCL and RL, weight can be quantized to the lower bit lengths. Simulations on ImageNet
classification and Flickr 8K image captioning show the negligible error increasing or even better in some
cases with 4-bit weight quantization. Using this weight quantization, multiplication can be replaced by table
look-up. The detailed architecture of the Q-table-based FRP is shown in Fig. 6. Each entry of the Q-table
contains the pre-computed multiplication result between a 16b fixed-point input and a 16b fixed-point
weight. After the Q-table is constructed once, only quantized indexes are loaded and decoded to select the
entry. In the FRP, there are total 127 entries, and those can function as one 7b Q-table to eight 4b Q-tables.
With the Q-tables, off-chip accesses can be reduced by 75%, and 99% of the 16b fixed-point multiplications
can be avoided.
III. Implementation Results

The proposed deep learning processor [6] shown in Fig. 7 is fabricated using 65nm CMOS technology
and it occupies 16mm2 die area. The proposed work is the first CNN-RNN SoC with the highest energy
efficiency (8.1TOPS/W). As shown in Fig. 8, it can operate from 0.765- 1.1V supply with 50-200MHz clock
frequency. The power consumption at 0.765V and 1.1V are 34.6mW and 279mW, respectively. For a
particular frame rate, energy-efficiency and bit-width (accuracy) can be traded off with one another. For the
CP, word length can be changed from 4b to 16b, and quantization bit-width can be configured from 4b to 7b.
Fig. 9 shows a performance comparison with the 3 previous deep learning SoCs. This work is the only one
that supports both CNNs and RNNs. Compared to [2] and [3], this work shows 20 and 4.5 higher energy
efficiency, respectively. Also, it shows 6.5 higher energy efficiency compared to [5].
IV. Conclusion
A highly reconfigurable deep learning processor is proposed with heterogeneous architecture and
dedicated LUT-based multipliers for CNNs and RNNs. As a result, the processor, implemented using 65nm
technology, achieves 8.1TOPS/W energy efficiency. Also, as shown in Fig. 10, it is successfully
demonstrated with the image captioning and action recognition.
[1] Vinyals, O. et al., Show and Tell: A Neural Image Caption Generator, CVPR, pp. 3156-3164, 2015
[2] Chen, Y. et al., Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural
networks, ISSCC Dig. Tech. Papers, pp. 262-263, 2016
[3] Moons, B. et al., A 0.3-2.6TOPS/W Precision-Scalable Processor for Real-Time Large-Scale ConvNets,
published at Symp. on VLSI Circuits, 2016
[4] Sim, J. et al., A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent
IoT Systems, ISSCC Dig. Tech. Papers, pp. 264-265, 2016
[5] Han, S. et al., EIE: Efficient Inference Engine on Compressed Deep Neural Network, published at ISCA,
2016
[6] Shin, D, et al., DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep
Neural Networks, ISSCC Dig. Tech. Papers, pp. 240-241, 2017
2017 IEEE COOL Chips 20 978-1-5386-3828-6/17/$31.00 2017 IEEE

An Energy-Efficient Deep Learning Processor With Heterogeneous Multi-Core Architecture For Convolutional Neural Networks and Recurrent Neural Networks

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

An Energy-Efficient Deep Learning Processor With Heterogeneous Multi-Core Architecture For Convolutional Neural Networks and Recurrent Neural Networks

Transféré par

Droits d'auteur :

Formats disponibles

An Energy-Efficient Deep Learning Processor with

Heterogeneous Multi-Core Architecture for

II. Processor Architecture

2017 IEEE COOL Chips 20 978-1-5386-3828-6/17/$31.00 2017 IEEE

III. Implementation Results

2017 IEEE COOL Chips 20 978-1-5386-3828-6/17/$31.00 2017 IEEE

Vous aimerez peut-être aussi