Vous êtes sur la page 1sur 16

NVIDIA Tegra Mobile Super Processors

Submitted By
Anoop.P S7 CSE 9206 MEC

The Mobile Challenge - High Performance and Low Power


Processing capabilities of computers and the internet connection bandwidth available to users have increased by several orders of magnitude over the last two decades. Consequently, there has been explosive growth in the amount and variety of content that is available on the internet. Web pages now include complex Java elements, high definition video content and Flash based games. There are over hundred million websites1 on the internet and over six billion videos2 are being viewed on Youtube every month. Farmville, a flash based game has alone attracted over 78 million3 new players on Facebook. Users today want to access and enjoy this rich content not only on their PCs but also on their mobile devices such as tablets and smartphones. Users today expect their mobile device to not only deliver long battery life but also provide the same full high definition web experience that they get from a PC. In addition to web browsing capabilities, users expect their mobile devices to handle a variety of use cases such as gaming, audio playback, navigation and photo/video capture. Current mobile devices are unable to offer both the performance and the battery life required for users to enjoy a high definition web experience. Most mobile device architectures are designed around a general purpose CPU. The CPU, primarily designed to handle general purpose operations, is very inefficient in handling specialized audiovisual tasks such as decoding HD video, 3D gaming and flash video playback. The CPU not only underperforms but also consumes excessive power for todays mobile use cases. Relying on a power hungry CPU either compromises battery life to offer good performance or delivers substandard experiences. This is starkly evident in the number of complaints seen in mobile gadget magazines, websites, and forums regarding current mobile devices providing only a few hours of battery life. In fact, numerous articles have been written to help consumers extend battery life, and these articles basically recommend users to turn off many of the smart features of their smartphones to get longer battery life. We call this the powerperformance paradox. The NVIDIA Tegra processor resolves this paradox with a new purpose-optimized processor architecture that delivers a responsive experience for full-resolution Web pages, full HD 1080p video playback, and days of battery life for typical use.

NVIDIA Tegra Multi-Processor Architecture


To overcome the power-performance paradox and the limitations associated with single-processor architectures, a revolutionary ground-up approach was taken in the design of the NVIDIA Tegra processor. Extensive architectural simulations and modeling of actual mobile and Web uses cases showed that a heterogeneous multiprocessor architecture was required to meet both performance and battery life needs of current and future mobile use cases. A heterogeneous multiprocessor by definition consists of several processors that are different in structure and function with each processor optimized for a specific purpose. The NVIDIA is a multiarchitecture eight processors encode and processing, power Tegra architecture heterogeneous processor that consists of independent for graphics, video decode, image audio processing, management, and

general-purpose functions. These processors are power managed independently with local hardware control and system level control built into each processor. A systemlevel power monitor allows the Tegra processor to turn on only those processors required for a specific use case, while keeping all other processors turned off. Tegras purpose-optimized processors are analogous to a carpenters tool kit that has various tools such as hammer, saw, screwdriver, and drill. A carpenter does not use a hammer to drive screws or a drill to cut wood. Each tool is optimized for a specific task or function. Likewise, the Tegra processors are all specialized for particular tasks

Graphics Processor (GPU): Delivers outstanding mobile 3D game playability and is also used for visually engaging, highly-responsive 3D touch user interfaces. Video Decode Processor: runs video macro-block oriented algorithms including inverse discrete cosine transforms (IDCT), variable length decode (VLD), color space conversion (CSC), and bit stream processing to deliver smooth, full frame rate 1080p HD video playback and streaming HD Flash video playback without compromising battery life. Video Encode Processor: runs video encode algorithms to deliver full 1080p HD video streams for video recording and conferencing capabilities. Image Signal Processor (ISP): handles light balance, edge enhancement, and noise reduction algorithms to deliver real-time photo enhancement capabilities. Audio Processor: handles analog signal audio processing to deliver over 140 hours of continuous 128kbps mp3 audio playback on a single battery charge. Dual-Core ARM Cortex A9 CPU: for general-purpose computing delivers faster web browsing and snappier response times on Java enabled websites ARM7 Processor: handles system management functions and several proprietary battery life extending features on NVIDIA Tegra.

Each processor adds instructions, caches, clocks, and circuits optimized for each specific task, with performance monitors to track system activity. Each processor is designed to provide the

performance needed for demanding and heavy workload tasks such as browsing java enabled Web pages, HD video playback, responsive 3D user interfaces, and gaming. The proprietary power management features in the NVIDIA Tegra processor track system workload levels and continuously manage the frequency and voltage of each processor to provide the demanded performance at the lowest possible power. Using a set of dedicated processors, the system can easily handle multi-tasking loads that include a combination of 3D, video, communications, and audio tasks. When loads do not require multi-tasking, only the processor required for the task is used while all other processors in the system are turned off, thus achieving the lowest power consumption possible.

Use Case Examples


Following are three example use cases to illustrate how the NVIDIA Tegra power management system optimizes processor use:

Listening to Music: During audio playback, the dedicated audio processor and the low-power ARM7 processor are turned on and all other processors are turned off. Turning off all the other processors allows the ARM7 processor to fetch a playlist and efficiently decode the audio using only 30 mW of power on average. If the device is in camera mode, the audio processor can be turned off and the ISP is turned on to grab camera sensor data at the click of a button. The ISP can then clean up the image and hand it off to the core processor to save on storage. Watching Videos: When watching movies, the video decode processor handles all of the H.264 decode of 1080p content and can stream the video to the built-in HDMI port for viewing of HD content on a big screen. In this case, the display processor and actual display screen can be turned off to save significant power. For example, the Tegra-powered Zune HD player from Microsoft can provide eight hours of HD playback with a cell phone-sized battery in this mode. Playing Games: During game play only the CPUs and the highly optimized NVIDIA GPU are turned on, and the video processors and ISP are turned off. This ensures that the high-performance processors consume power only when demanded and consume zero power when not in use.

Eight Purpose Optimized Processors


The diagram below shows the NVIDIA Tegra purpose-optimized processors. Each of these processors is power managed at a global level through software and locally through hardware mechanisms.

Dual-Core ARM Cortex A9 CPU: NVIDIA Tegra features the worlds first dual core CPU for mobile applications in addition to support for Symmetric Multi-Processing (SMP). SMP enables tasks to be parallelized between the two ARM Cortex A9 processors, and delivers faster load times of Web pages, quicker UI responsiveness, and faster rendering of complex Web pages. NVIDIA has implemented aggressive dynamic voltage and frequency scaling to minimize the power used by the dual-core Cortex A9 CPU. A typical Web browsing activity consists of a short initial period of Web page load followed by a longer period for reading the content of the page. When a new Web page is opened the CPU springs to life at full speed to quickly load the page and process java elements in the page. The Global Power Management System monitors the workload presented to the CPU and tunes the CPU voltage and frequency to match the needs of the workload. When the CPU is not in use (for example after a Web page has loaded and the user is reading the content of the page), power to the CPU is completely turned off with full power gating. Even when Web pages have Flash visual elements, the CPU can be completely turned off because the Flash visual features are offloaded to the GPU and video processor in the NVIDIA Tegra architecture.

ARM7 Processor: To avoid the use of the higher power Cortex A9 CPU for tasks that have lower performance requirements, an ARM7 processor is included in the Tegra architecture. This processor is primarily used in concert with the audio and video processors to deliver ultra low power audio and low-power video playback capabilities. Ultra Low-Power Graphics Processor (GPU): The NVIDIA Tegra processor includes a marketleading NVIDIA GPU with programmable vertex shading and pixel shading processors. The hyper-efficient 2D/ 3D processor renders user interfaces, pictures, fonts, and other graphical elements instantly using only battery-sipping amounts of power. The 3D processor can render Quake 3 at 1024x600 resolution at over 60 frames per second using only a few hundred milliwatts. The GPU in NVIDIA Tegra is also capable of handling the demanding workloads presented by modern game engines. The Unreal game engine that is used by several hundred PC games runs effortlessly on the NVIDIA Tegra processor. This capability will now enable game developers to easily port PC games to NVIDIA Tegra based mobile platform with minimal effort. The GPU can also render full-screen Flash animations using only 150 mW. This is critical for long battery life when browsing the Web or playing Flash-based games. Unlike ARM CPUs, the GPU in the Tegra processor has dedicated hardware to handle video workloads. ARM CPUs are generic processors and the CPU hardware is not optimized for specific tasks. Thus, they struggle to handle Flash content and render Flash video and graphics elements at very low frame rates while consuming relatively large amounts of power. The GPU is specifically optimized to decode Flash video and graphics elements and is able to decode Flash content at full frame rates and render Flash visual elements at much lower power levels. HD Video Decode Processor: The video decode processor is used to decode video streams from files that are either played from the local disk or streamed off the network. Over 80% of todays top Web sites include video, most of which is Flash-based video content. The

dedicated video processor handles all three Flash video formats: H.264, Sorenson and VP6-E. As a result, Flash video runs on NVIDIA Tegra at full frame rates and consumes very low power. Other mobile solutions that use general-purpose CPUs for Flash video deliver stuttering images and drain battery life. The decode processor is based on several generations of NVIDIA hardware decode expertise and is able to deliver flawless 1080p video playback while consuming less than 400 mW of power. HD Video Encode Processor: The video encode processor can be used to encode 1080p content coming in from an HD video image sensor at 30 fps. The encoded video stream may then be saved locally or transmitted live over a broadband network. The video encode and decode processors can be used together to deliver HD video conferencing. Audio Processor: The dedicated audio processor is used to deliver ultra low-power audio playback that can last for days. The audio processing pipeline is designed to consume less than 30mW of power while playing back a 128 Kbps MP3 file. Image Signal Processor (ISP): Many mobile devices include high resolution sensors for taking pictures and lower resolution cameras for Web video. The ISP in the NVIDIA Tegra processor is capable of taking raw camera sensor input at up to 12 megapixel resolution and 30 frames per second. The image processor can then apply real-time image enhancement algorithms like automatic white balance, edge enhancement, and noise reduction, and can even adjust for poor lighting conditions. The output of the ISP is then ideal to save as a picture or as a stream that can be compressed for live video conferencing or sent with HD quality across a broadband network.

The eight independent processors of the NVIDIA Tegra multi-processor architecture deliver unprecedented performance and multimedia capabilities to mobile devices. NVIDIA Tegra implements a Global Power Management System that utilizes hardware feedback and software algorithms to deliver industry-leading battery life for todays mobile devices.

NVIDIA Tegra Global Power Management System


Distributed across the NVIDIA Tegra multi-processor architecture are numerous hardware monitors that collect data on activity levels, frequency, temperature, and incoming request patterns. The Global Power Management System uses the data fed back from these monitors to determine the optimal operating frequency and voltage for the active processors. This type of feedback control ensures that the processors are always running at the lowest possible power level. Additionally, a feed-forward control mechanism is used to instantaneously boost operating frequency and voltage for specific requests that indicate upcoming high performance requirements. This ensures that users do not experience any lag or performance degradation for activities such as 3D UI navigation and gaming.

Ground-Breaking Performance and Battery Life


The table below illustrates battery life the user can expect from an NVIDIA Tegra-based tablet (with 2000 mAh battery and 400 mW display).

Optimized for Key Mobile Use Cases


Mobile use case studies show that most mobile devices are typically in active standby state for eighty percent of the time, and process intensive mobile applications twenty percent of the time. Think about your device sitting in your pocket or on a desk as Active Standby or when the user is not actively interacting with the device, and the processor is either running background tasks or low performance applications that do not require user interactions. On the other hand, when youre using your device browsing the web, checking Email, gaming, running multimedia applications, playing back media, and so on the device is in a performance-intensive mode that may require one or more CPU cores running at higher frequency ranges. Keep in mind, when the device is in active standby state many tasks are still happening in the background - Email syncs, social media syncs, live wallpapers, active widgets, and more. Such tasks can be processed by just a single CPU core running at much lower frequencies. Users generally do not care how fast the background tasks are processed, only that they happen and do not consume much battery life. When in active standby state, battery life can be improved significantly by minimizing the active standby state power consumption of mobile processors.

Silicon Process and its Impact on Power and Frequency


Power consumption of a silicon device is equal to the sum of leakage power and dynamic power. Leakage power is mainly determined by the silicon process technology, and dynamic power is determined by silicon process technology and by operating voltages and frequencies. Dynamic power of a silicon device is proportional to the operating frequency, and more importantly proportional to the square of the operating voltage. Total Power = Leakage Power + Dynamic Power Dynamic Power Frequency x Voltage^2 When a silicon device is operating at or near peak frequency, the total power consumption of the device is dominated by the dynamic power consumption, and when the device is idle or operating at near idle conditions, the leakage power is a significant portion of the total power consumption. Transistors on fast process technologies consume high leakage power and have very fast switching times at normal voltage levels. Therefore, CPU B CPU cores built on fast process technologies (CPU A in Figure 2) consume high leakage power under idle or active standby conditions, but are capable of running at higher frequency ranges without requiring significant increases in operating voltage. Transistors on low power process technologies have low leakage power but have slower switching times at normal voltage levels, and to enable them to switch faster (for high frequency operations) they need higher than normal voltage ranges. CPU cores built on low power process technologies (CPU B in Figure 2) consume very low leakage power but require higher than normal voltage levels to operate at very high frequencies. Therefore, they consume excessive amounts of dynamic power and can cause significant power and thermal issues. The following simplified statements effectively convey the concept: Fast Process = Optimized for high frequency operation, but higher leakage Low Power Process = Operates at lower frequency with lower leakage

Figure 1 Background tasks that are typically running when device is in active standby state

To meet the rapidly increasing demands of performance-intensive mobile use cases and extended battery life, it is becoming increasingly difficult to minimize both the active standby and dynamic power consumption of CPU cores. By combining both silicon processes (defined above), along with architectural optimization, a single SoC (System-on-a-Chip) can be optimized for both high performance and low power consumption.

Variable Symmetric Multiprocessing


NVIDIAs Project Kal-El is the worlds first mobile SoC device to implement a patented Variable Symmetric Multiprocessing (vSMP) technology that not only minimizes active standby state power consumption, but also delivers on-demand maximum quad core performance. In addition to four main Cortex A9 high-performance CPU cores, Kal-El has a fifth low power, low leakage Cortex A9 CPU core called the Companion CPU core that is optimized to minimize active standby state power consumption, and handle less demanding processing tasks. Project Kal-El also includes other patented vSMP technologies that intelligently manage workload distribution between the main cores and the Companion core based on application and operating system requirements. This management is handled by NVIDIAs Dynamic Voltage and Frequency Scaling (DVFS) and CPU Hot-Plug management CPU on low power process software and does not require has lower leakage power but any other special modifications to the operating system. consumes more power at higher performance ranges CPU A Low Power Companion Core

POWE R

process in the low performance ranges (and frequencies), it consumes lower power than the main CPU cores that are built on a fast process technology. Power-performance measurements on Kal-El show that the Companion core delivers CPU on Fast Process- has higher higher performance per watt than the leakage power in active standby main cores at operating frequencies below 500 MHz, and therefore the maximum operating frequency of the Companion core is capped at 500MHz. Table 1 compares and contrasts the Companion core to the four main cores on Kal-El.
Figure 2 Mobile CPU Power-Performance curves

The Companion core is designed on a low power process technology, but has an identical internal architecture as the main Cortex A9 CPU cores. Since it is built on a low power

The Companion core is used primarily when the mobile device is in active standby and performing background tasks such as Email syncs, Twitter updates, Facebook updates etc. It is also used for applications that do not require significant CPU processing power, such as streaming audio, offline audio, and both online or offline video playback. Note that both audio and video playback, in addition to video encoding, are largely processed by hardware-based encoders and decoders. Unlike the Companion core, the main CPU cores need to operate at very high frequencies to deliver high performance. Therefore they are built on a fast process technology which allows them to scale up to very high operating frequencies at lower operating voltage ranges. Thus the main cores are able to deliver high performance without significant increases in dynamic power consumption.

Figure 3 Low Power Companion CPU on Kal-El

Power optimized Companion CPU Core


Architecture Process Technology Operating Frequency Range

Performance optimized main CPU Cores Cortex A9 General/Fast (G).

Cortex A9 Low Power (LP) 0 MHz to 500 MHz

Using the combination of Table 1 Companion and Main CPU Core features performance-optimized main cores and a power-optimized Companion core, Variable Symmetric Multiprocessing technology not only delivers ultra-low power consumption in active standby states, but also on-demand peak quad core performance for performance hungry mobile applications such as gaming, Web browsing, Flash media, and video conferencing. vSMP technology successfully combines the power-performance benefits of the poweroptimized CPU B and performance-optimized CPU A shown in Figure 2 and delivers a powerperformance curve that looks like the one shown in Figure 4.

Max Quad Core 0 MHz to Max GHz Performance

Operating System Transparent Implementation


The Android 3.x (Honeycomb) operating system has built-in support for multi-processing and is capable of leveraging the performance of multiple CPU cores. However, the operating system assumes that all available CPU cores are of equal performance capability and schedules tasks to available cores based on this assumption. Therefore, in order to make the management of the Companion core and main cores totally transparent to the operating system, Kal-El implements both hardware-based and low level software-based management of the Companion core and the main quad CPU cores. Patented hardware and software CPU management logic continuously monitors CPU workload to automatically and dynamically enable and disable the Companion core and the main CPU cores. The decision to turn on and off the Companion and main cores is purely based on current CPU workload levels and the resulting CPU operating frequency recommendations made by the CPU frequency control subsystem embedded in the operating system kernel. The technology does not require any application or OS modifications.

Workload-Based Dynamic Enabling and Disabling of CPU Cores


When the Companion core is turned off and the mobile audio, video, Email processor is using the main cores for syncs, social media processing, the CPU Governor and CPU management logic syncs, etc. continues to monitor CPU workload and utilization of each of the main cores, dynamically enabling or disabling one to four of the main cores. For example, applications such as Email, basic Single core games, or text messaging typically need the processing power of just one of the four main cores. performance for Email, For more demanding applications 2D games, basic such as Flash-intensive Web browsing or heavy multitasking, the CPU manager may turn Dual core performance POWE on for Flash enabled two CPU cores. But to meet peak performance R browsing, multitasking, demands of applications such as console class video chat, etc. gaming and media editing or creation, all four CPU cores would be turned on to deliver peak performance demanded by the application(s). Companion Core = ON Quad core performance

Background tasks,

Companion Core = OFF Main Cores = ON

browsing, maps, etc.

Main Cores = OFF

for console class gaming, faster browsing, media processing

PERFORMANCE
Figure 4 Power-Performance curve of Companion core plus quad main cores running on vSMP Figure 5 CPU core management based on technology workload

Architectural Advantages of vSMP Architecture


Variable SMP technology has several architectural advantages compared to other solutions, such as asynchronous clocking. Cache Coherency: Since vSMP technology does not allow both the Companion core and the main cores to be enabled at the same time, there are no penalties involved in synchronizing caches between cores running at different frequencies. The Companion and main cores share the same L2 cache, and the cache is programmed to return data in the same number of nanoseconds for both Companion and main cores (essentially it takes more main core cycles versus fewer Companion core cycles because the main cores run at higher frequency). OS Efficiency: The Android OS assumes that all available CPU cores are identical with similar performance capability and schedules workloads to these cores accordingly. When multiple CPU cores are each run at different asynchronous frequencies, it results in the cores having differing performance capabilities. This could lead to OS scheduling inefficiencies. In contrast, vSMP technology always maintains all active cores at a similar synchronous operating frequency for optimized OS scheduling. Even when vSMP switches from the Companion core to one or more of the main CPU cores, the CPU management logic ensures a seamless transition that is not perceptible to end users and does not result in any OS scheduling penalties. Power Optimized: Each core in an asynchronous clocking based CPU architecture is typically on a different power plane (aka voltage rail or voltage plane) to adjust the voltage of each core based on operating frequency. This could result in increased signal and power line noise across the voltage planes and negatively impact performance. Since each voltage plane may require its own set of voltage regulators, these architectures may not be easily scalable as the number of CPU cores is increased. The additional voltage regulators increase BOM (Bill of Materials) cost and power consumption. If the same voltage rail is used for all cores, then each core will run at the voltage required by the fastest core, thus losing the advantage of the voltage squared effect for power reduction.

Since vSMP technology does not suffer from the penalties of cache synchronization and

scheduling across cores running at asynchronous frequencies, it can deliver higher performance than architectures that use asynchronous clocking technologies.

Architectural Challenges and Solutions


vSMP architecture introduces several challenges, but several unique solutions have been created that address these challenges. Switch time: vSMP must ensure the process of switching between the companion CPU core and the main CPU core does not result in slower application load times and perceptible lags in user experience. To address this situation, NVIDIA implemented advanced circuits and logic to enable high speed switching efficiency. Internal simulations show that the total switching time, including the time to switch cores within Core Thrashing: vSMP must prevent frequent back and forth switching between the Companion and main cores when the workload varies around the threshold value used to trigger a switch between the cores, Project Kal-El Tegra 2 which would result in poor performance and negate any power savings benefits. To address this issue, sufficient intelligence and programmable hysteresis control is built into the CPU management algorithms that continually monitor and adapt to the workload, preventing any thrashing between the cores.

Power Savings on Kal-El due to vSMP

Power Benefits of Variable


28%

14% lower 61%

Symmetric Multiprocessing

34% lower

vSMP technology delivers significant power savings by lower minimizing both leakage power consumption in active standby state using the Companion core, and dynamic power consumption at peak operating frequencies using the four main cores. Depending on the use case, vSMP technology dynamically enables and disables CPU cores to deliver the desired performance at the lowest possible power. The chart below illustrates that Project Kal-El is lower power for all use cases. The chart measures Tegra 2 and Project Kal-El use, both on TSMC 40nm manufacturing technology.

Mobile Processor Project Kal-El (each core running at 480 MHz) OMAP4 (each core running at 1 GHz)

Measured Power (mW)2 579 1501 1453 1261

Coremark Performance 5589 5673 5690 11667

Power Benefits of Quad Core


QC8660 (each core running at 1.2 GHz) Project Kal-El (each core running at 1 GHz)

compared to Dual Core

Table 2 Measured Power and Performance on Project Kal-El and Competing Processors

In addition to vSMP technology, its important to remember that more cores are better for power management than fewer cores. As an example, quad core CPUs deliver lower power at all performance points compared to dual core CPUs. The reason is that four cores can run at a lower frequency, and hence lower voltage when processing the same amount of work as a dual core CPU. Since power is proportional to the square of the voltage, the overall CPU power can be significantly less and still complete the same amount of work. Table 2 shows the measured power consumption and performance levels of Project Kal-El versus competing dual core processors running the Coremark benchmark, a popular mobile benchmark that measures single- or multi-core CPU performance. Note in the table below that Project Kal-Els CPU consumes 2-3x less power than competing solutions when it is constrained to the same level of performance i.e., each processor completing roughly 5k of Coremark work. And even when Kal-El is run at a higher frequency, completing more than 2x the amount of Coremark work, it is still less power than dual core solutions.

Figure 6 Power savings on Project Kal-El due to vSMP technology 1


1

Power measured as a sum of application processor power and DRAM power after normalizing for other system variables. LP0 is the lowest power state of the respective Tegra devices.

Note that Kal-El consumes lower power than competing dual core processors even when it has all four CPU cores running at 1 GHz. Because Kal-El uses fast process technology for its performance optimized CPU cores, these four cores are able to operate at higher frequencies using lower operating voltage ranges than competing processors. Since dynamic power consumption is proportional to the square of the operating voltage, Kal-El achieves significant power savings even when operating at higher frequencies. As the performance requirements of mobile applications increase, SoC vendors are not only adopting multi-core processor architectures to deliver the increased performance, they are also doing it to keep power consumption within mobile budgets. The Variable Symmetric

Multiprocessing (vSMP) technology employed in Project Kal-El delivers a new level of power savings that not only minimizes power consumption during active standby states, but also delivers quad core performance benefits while keeping dynamic power consumption within thermal budgets required for mobile devices. The use of the Companion CPU core for background tasks and the use of main cores for performance-intensive tasks, enable Project Kal-El to deliver significantly lower power than competing mobile processors at all performance levels. Quad core CPUs and variable SMP technology will enable mobile devices to further push the performance envelope, and allow application and game developers to deliver new mobile experiences, all while extending battery life for the most popular use cases.

Vous aimerez peut-être aussi