Vous êtes sur la page 1sur 9

TMS320C67x vs. ADSP-21160: Which Floating-Point DSP Offers Highest Performance?

Criterii pentru selectia procesorului

By Ian Main TechOnline (08/22/1998 0:00 AM EDT) The Texas Instruments TMS320C67x and the Analog Devices ADSP-21160 SHARC processors are the two highest performance floating-point DSPs on the market today. Which of these two processors provides the highest system performance? System engineers must select the device that provides the most effective solution to meet the requirements of their DSP application. While the obvious step is to compare the raw processing power of the two processors, this comparison will give little indication of expected system performance, especially in highly demanding multiprocessing applications. The selection of the better DSP platform from a systems perspective requires an analysis of many aspects of the application. Firstly, the I/O data rates and channel density must be reviewed to determine the bandwidth in and out of the system. The next step involves the mapping of DSP algorithms to DSP devices. This may be complex and requires an understanding of I/O data paths, memory management, inter-processor communication capability and synchronization mechanisms. While the resolution of these issues determines the best technical solution, other factors also require consideration. For example, time-to-market is influenced by the availability of third-party library support and the characteristics of the development tools accompanying each processor. This paper discusses the factors that the system engineer should consider in selecting one of these processing platforms over the other. The discussion includes an analysis of specific applications to illustrate the system parameters that should be compared in the decision process. The paper also deals with the support for each platform, highlighting tools that assist the developer in achieving the highest-performance in each case.

Raw Performance: Chip vs. Chip A comparison of the two components logically begins with an analysis of the features of each device. Rather than a comprehensive feature list, this section summarizes the features that differentiate the performance of each"full specifications are available in the datasheets provided by each vendor. As a detailed specification was not available for the 'C67x and the '21160, at the time of writing, some parameters (such as power consumption) are not addressed here.

Feature

Analog Devices('21160 SHARC (100MHz)

Texas Instruments ('C6701 (167MHz)

Performance (MFLOPS) Peak Sustained 600 400 1 GFLOP 500-700 (Spectrum's estimate) 334 667MB/s 667MB/s (+ Host port) (+ Serial Ports) Interrupt Latency 4 x 10nS cycles 11 x 6nS cycles

MMACs Bandwidth: External Memory I/O Bandwidth: Total

200 534 MB/s 1.134 GB/s (+ Serial Ports)

Core Features Number of Data Registers 32 (+32 alternate) Extended Floating Point Support 40-bit extended precision 64-bit double precision 32

Peripheral Features Internal Memory Size Program Memory Structure 4 Mbit (2x2Mbit) Configurable 48-bit instructions Data Memory Structure Configurable: 16-,32-,48- or 64-bit Cache 32 Instructions (if selected) DMA Channels 14 4+1(HPI) 1 Mbit (2x 512kBit) 16k x 32 2K x 256 bit instructions 16k x 32 8/16/32/40/64-bit data Entire Internal Program Memory (if selected)

I/O Capability Primary external Data Interface Serial Ports Other Ports 64 bits @ 66 MHz 2 @ 66 Mbit/s 6 x Link Ports @ 100MB/s 32 bits @ 167 MHz 2 @ 83.5 Mbit/s Slave Host Port @ clk/4 (41.5 MB/s) None

Inherent Multiprocessing Support

Cluster and Link Ports

From the above table, it can be seen that the raw processing power of the 'C6701 exceeds that of the '21160 by approximately 30%. In general, this gives the 'C6701 an advantage in single processor low and medium bandwidth configurations. However, the '21160, with more than double the I/O bandwidth and four times the internal memory capacity, makes it a more appropriate solution in high-bandwidth and multiprocessing applications. Of course, the 'C6701 also has significant I/O bandwidth and with the assistance of external hardware, it may also be used effectively in multiprocessing architectures. The merits of each are investigated in the multiprocessing section. The following three sections compare the '21160 and 'C67x families, beginning with a review of the implications of different memory support for each. This is followed by an analysis of algorithm distribution and data flow in multiprocessing systems and ends with a study of the system I/O capability of each.

Local Memory Support It is clear that the '21160 gains the upper hand when it comes to internal memory capacity. However, it is rare that an entire application and its associated data can be accommodated in internal memory for either of these devices. It is therefore worth investigating the external memory options available in each case and considering the performance. High-Performance Memory Support There are many instances where the algorithm developer needs high performance external memory, but in some circumstances, it is critical to the application. With the advent of synchronous burst memory support and ever-increasing on-chip memory, application code is usually executed from internal memory for highest performance. However, critical variables (such as filter tap coefficients) must frequently be stored externally due to a limitation of internal resources. Both the '21160 and the 'C67x support high-performance external memory. The '21160 processor may be interfaced to asynchronous (ASRAM), synchronous SRAM (SSRAM) memory, either 32-bit or 64-bits wide. It supports synchronous and sequential burst transfers for the efficient transfer of large blocks of data. '21160's DMA controller automatically packs external data (16-, 32-, 48-, or 64-bit) into the appropriate internal word width, either 64-bits or 48-bits wide. The 'C67x directly supports 32-bit SSRAM, SDRAM, and ASRAM as its high-performance resource. This memory will likely be available at 167 MHz by the time the DSP is shipping, allowing for single cycle access. The pipeline delay of SSRAM should be taken into account in throughput considerations"this adds three cycles for each first access. The consequence here is that critical sections of code must be run from internal DSP memory as it will require more than 8 clock cycles to load a single 256-bit instruction from any external memory. In summary, the 167 MHz synchronous interface of the 'C67x will gives it an external memory access advantage of 668 MBytes/s vs. 528 MBytes/s, some 25% advantage. However, this can only be realized for multiple consecutive external accesses where the pipeline delay becomes negligible and is somewhat negated by the fact that '21160 has significantly larger internal memory. In cases where consecutive instructions must be accessed from external memory, the processing performance of both devices drops. The theoretical performance of the 'C67x can be reduced from 3

1GFLOP to 167 MFLOPS. The '21160s peak performance drops from 600 MFLOPS to 396 MFLOPS due the clock differential between external and internal buses. High-Density Memory Support In data driven applications (e.g. imaging and radar), the DSP requires high-density memory for temporary storage of data. Usually memory access is sequential due to the correlated nature of the data. With the addition of some external logic, the '21160 may be interfaced to low cost bulk DRAM with one or two 15ns wait states. The 'C67x, on the other hand, supports a glueless connection to SDRAM. As with SSRAM, there is a pipeline latency of three cycles, but sequential accesses take two 6nS clock cycles. Paging and refresh delays also need to be considered, as these will result in non-deterministic delays of ten cycles or more. In spite of this, SDRAM clearly has an access advantage over DRAM when making sequential accesses to large sets of data.

What about Multiprocessing? The "sub-system" (device and local memory) comparison presented above does not address system performance concerns associated with multiprocessor implementations using either of the devices. If multiprocessing is necessary to meet either the real-time demands of the application or high I/O rates, DSP system performance becomes more relevant than device features. System performance considers algorithm and data distribution in addition to inter-processor communication capability. This section explores the performance that can be expected using multiprocessor configurations of both devices. Data Storage and Distribution Here we consider the movement of data between algorithms in a multiprocessing system. Whether using a '21160 or 'C67x platform, it is a good practice to decouple the flow of data from the actual processing algorithms. This can be done using DMA co-processors to manage data flow between sub-systems by transferring large blocks between intermediate buffers. This is particularly important for the 'C67x where optimized inner loops running on the DSP cannot be interrupted to service I/O or manage data. By decoupling data structures, these software pipelines will be allowed to run to completion, ensuring peak performance. Of course, if extreme low latency is a requirement, 'C67x loops must be unrolled at the expense of code size. Even then, the memory pipeline of the processor results in a latency when switching tasks (an 11 instruction latency to flush the pipeline and vector to the new address). In contrast, inner loops on the '21160 processor are interruptible, making it easier to balance low-latency I/O performance with optimum CPU performance. When it comes to distributing data around a multiprocessor system, the '21160 supports this directly through both Link Ports and broadcast capabilities of the multiprocessor cluster architecture. The 'C67x relies on the DSP board architecture to provide a flexible communication system with external DMA facilities to move data between DSP sub-systems. Algorithm Distribution 4

If the software developer is used to mapping algorithms directly to standard nodal topologies as a method of distributing the algorithm (e.g. mesh or hyper-cube), the '21160 probably remains the processor of choice as it supports these physical topologies through Link Port connections. However, if a 'C67x platform is selected with a DSP RTOS that supports a virtual network between tasks, the standard topologies can still be implemented in abstraction from the hardware layer. Loading Code from External Memory If algorithms are run from internal memory, it is easy to predict the data I/O throughput for both processors. If algorithms must be loaded from external memory, a more careful analysis may be required. The '21160 supports synchronous operation, burst transfers and asynchronous external memory. Exact throughput and access latency depends on the interface used. In general, code is transferred to internal memory under DMA control, unpacked and run from internal memory rather than being executed directly from external memory. The 48-bit wide instructions may be stored in packed format in 64-bit wide memory. This means that four instructions are loaded every three 66 MHz clock cycles (88 Million instructions per second load rate). If algorithms are run from SBSRAM on the 'C67x, code is burst into internal memory at 667 MB/s (assuming a 167 MHz memory bus), with a three cycle initial latency to fill the pipeline of the external memory. As each fetch packet is comprised of eight 32-bit wide instructions, execute packets are loaded at between 21 and 167 million packets per second, dependent on the number of arithmetic units being targeted. If these code accesses are interleaved with SDRAM data accesses, for example, prediction becomes complex due to paging and refresh cycle latencies and performance is poor. As is the case with '21160, code is generally not executed from external memory due to performance degradation. For large algorithms, it is more efficient to run the processor with the cache enabled, allowing execution from internal memory. Inter-Processor Messaging The efficient passing of semaphores and low-latency messages is integral to any multi-processing system. The '21160 supports these through a multiprocessor memory space within a cluster for broadcasting of messages and Link Port connections between clusters and DSP boards for point-topoint connections. The 'C67x relies on external resources provided on the DSP board. For example, Spectrum includes DPRAM and QPRAM in the dual and quad 'C67x implementations of the 'C6x architecture. This memory connects directly to the external bus of all the processors and provides a low-latency path between subsystems. Of course, interrupts provide the lowest latency mechanism for inter-processor signaling and synchronization. Whether a '21160 or 'C67x is selected, ensure that the DSP carrier board supports inter-processor interrupts.

Pumping Data In... and Out By definition, DSP applications are required to move digitized waveforms in and out of the system. Due to the diverse nature of the real-world signals, this data varies in bandwidth, resolution and number of channels and it is impossible to generalize the I/O processing requirements. Let us consider a few "typical" scenarios:

Single Processor as Target It is safe to say that if a single DSP is the target of all input data, system considerations are similar whether you select a 'C67x or a '21160 processor as the DSP. Assuming that the application can run from internal memory, a 'C67x may have a performance edge in managing a single high bandwidth stream than the '21160 (668 MBytes/s vs. 528 MBytes/s external port throughput). However, it could be argued that the larger internal memory capacity of '21160 will negate this advantage due to it's capacity to handle larger data blocks. Assuming the data can be made available on Link Ports, the '21160 is more effective at managing multiple medium-to-high bandwidth channels using its DMA resources. In applications where I/O data transfers from the external port to local DSP memory are interleaved with processor data accesses (local memory to internal registers), there is a trade-off between data block size and real-time response no matter which DSP is selected. Multiprocessor Systems as Targets for High-Bandwidth Data If the processing requirement of the application is excessive (due to high bandwidth), no matter which technology is selected, a multiprocessor solution is required. This section investigates these high-end applications. Due to its inherent multiprocessor support, the '21160 is well suited to these applications. The network can easily be scaled to suit the I/O processing requirements. The availability of off-board Link Port connections makes scaling just as easy across multiple DSP boards as it is across DSPs on a single board. Additional features such as the capability to broadcast data throughout a cluster make distribution of the input data easy. The 'C67x, unlike its floating-point predecessor, the TMS320C40, has no native multiprocessing support. It has been left to the DSP board vendors to innovate effective methods of achieving interprocessor communications. Spectrum's 'C67x architecture is an example of this, using a specialized ASIC to bridge each DSPs to a common PCI backbone. This allows for a distributed memory architecture with each DSP having the ability to pump data to the local memory of any other DSP on the same board. However, it is more difficult to distribute the data across multiple boards. It is considered poor practice to use the system bus (VME, PCI, CompactPCI, VXI etc.) for high bandwidth data and consequently a number of I/O buses (e.g. FPDP and Raceway) support the multiple slave DSP boards networked to an I/O master. The I/O-bus to DSP carrier board connection is often implemented using open standard interconnects e.g. PMC modules. If tighter coupling between I/O and DSPs is required, this may achieved by connecting the I/O directly to the external memory bus via a local mezzanine e.g. Processor Expansion Module (PEM). Whether a 'C67x or a '21160 is selected, there are numerous DSP network topologies available to support the I/O data flow requirements of most applications. For example, Spectrum's 'C67x architectures support PMC-based I/O streamed to all of the local processing nodes and on the SHARC platforms, I/O can be easily distributed using Link Ports. This allows one to stream incoming data between two or more processors to distribute the processing load. In both cases, the throughput is limited by the performance of the local PCI bus rather than any DSP capabilities.

Multiprocessor Systems as Targets for Low-Bandwidth Data There are two instances where applications require multi-DSP configurations with low data rates. Firstly, there are applications with computationally intensive algorithms where the I/O bandwidth exceeds the processing capability of a single processor. Secondly, in applications with multiple I/O channels, it is often convenient to distribute the I/O processing across a network of DSPs. In the first instance, the 'C67x may offer a better solution as it will reduce the number of DSPs required in the system due to the higher CPU core performance. In the second instance, with limited channel count, either a 'C67x or '21160 may be appropriate. Once the channel count demands a multi-board solution, the '21160 may be preferable due to its inherent support for inter-board communications through Link Ports. Finally, both the '21160 and the 'C67x support two TDM serial ports. Most DSP systems vendors make these available to the user for direct connection to their I/O circuits.

Which Technology Will Get My Application to Market First? No matter which DSP you select, the majority of your design cycle will be spent developing software. If the support tools are good, your application will likely be a success even if you did not select the optimum DSP. If the tools are inappropriate, the best DSP will loose its advantage. It is worth considering how the tools support your specific application. In addition, the assistance provided by third parties should significantly reduce your time-to-market. Multitasking Support It is easy to conceive mapping multiple tasks to multiple DSPs in a '21160 network, especially if we consider a single task per processor. In simple pipelines or array processing applications, the '21160 may be the processor of choice due to its support for separate tasks or algorithms at each node of a multi-dimensional array. However, many 'C67x (and '21160) applications may require multiple tasks multiplexed onto each DSP. In such cases, a DSP-based RTOS provides the developer with a scheduling kernel to simplify development. Some of these RTOS kernels (e.g. 3L's Diamond) have very low overhead and provide other features (e.g. inter-task communications independent of the underlying hardware.) The 'C67x processor will likely be the target of multi-instance applications (e.g. modems). Once again, development can be simplified through the selection of an appropriate RTOS to manage context switches and multiple data streams. Diamond and others (e.g. Eonic's Virtuoso) are likely to be supported on both platforms. Third-Party Library Support The availability of optimized function libraries (e.g. Imaging, Math and Signal Processing) allow developers to concentrate on their own applications rather than time-consuming hand coding of commonly used building blocks. It usually takes a year before third party library support for any DSP processor is available and the 'C67x will probably be no exception. Texas Instruments maintain an up-to-date web site with free code examples; this is a useful resource for new 'C67x developers. The '21160, being code compatible with the ADSP-2106x SHARC product range, already has library support from companies like Wideband Computers. Libraries that are optimized to take advantage of the SIMD instruction set of the '21160 will probably follow shortly.

Development Environments Both Texas Instruments and Analog Devices supply a solid suite of DSP development tools. If there is any difference, it is in the way that TI's C67x tools focus on code optimization while the strength of ADI's Visual-DSP lies in its multiprocessor support. For example, the natural development methodology using TI Tools is as follows: 1. Develop the application in C 2. Write inner loops in linear assembly language 3. Use the assembly optimizer to take full advantage of the VLIW architecture. The assembly optimizer is the key to assisting the developer in gaining maximum code performance. When developing code for the '21160, hand optimization will be required to take full advantage of the SIMD instruction set. The management of multiple tasks on different processor within a clusters is complex. However, Visual-DSP simplifies C code development in this multiprocessing environment through a sophisticated linker that supports shared memory and multiprocessor linking. Additionally, flexible overlay support allows the development of code that can be moved between overlays and non-overlay memory without rework.

Classifying DSPs By Application Both the 'C67x and '21160 are inherently targeted at similar DSP market segments purely because they are both floating-point processors. These applications include those with wide dynamic-ranges or poor signal-to-noise ratios. Examples include remote sensing and medical imaging, precision control and some communications applications. It would be easy if we could classify the 'C67x or the SHARC according to application (e.g. DSP X works for Sonar and DSP Y is best for Medical Imaging). Unfortunately this is seldom possible. Let us take sonar as an example. Within sonar, we may get a simple replica correlation application running on a DSP connected to a single hydrophone and an alarm. A towed-array sonar system, on the other hand may have a few hundred sonar pods feeding into a meshed array of DSPs running multiple beamforming algorithms. In the first instance, the best solution may be one or two 'C67x processors while multiple '21160s may be more appropriate for the sonar array-processing application. Floating-point DSPs are also selected as development platforms for fixed-point applications. This is due to the ease of coding during the proof-of-concept phase. In this light, both processors may be used as the springboard for any fixed-point application - as with floating-point, it is impossible generalize here. In conclusion, it is more appropriate to select the DSP platform according to the multiprocessing, I/O and support requirements discussed above than attempt to classify the applications.

Selecting the Best Processor 8

Both Texas Instruments and Analog Devices are already planning future productsADI have plans for next generation SHARC processors and both have aggressive plans to increase the speed of the current range of processors. Whether or not a specific one of these technologies captures the floating-point DSP market in the long term remains to be seen. In all likelihood, they will continue to compete for the foreseeable future. This paper has investigated various aspects of single and multiprocessor implementations using both the 'C67x and '21160 processors. While the 'C67x appears to have a performance edge in single processor implementations, the '21160 may gain the upper hand in multiprocessing applications. The most suitable platform depends on data flow, memory requirements, array topology and algorithm characteristics.

Epilogue The more you contemplate floating-point DSPs, research floating-point DSPs, and talk to other people about floating-point DSPs, the tougher your decision becomes. To help you work your way through this mental quagmire, Spectrum Signal Processing has developed the on-line DSP System Evaluation Tool. This tool, based on the TI and ADI toolsets, questions you about your application's software architecture, code generation tools, system performance requirements and time-to-market requirements. The tool then allows you to provide a real code-segment. The web server pipes this code through the respective DSP compilation and simulation tools and returns with the number of cycles the code took to execute on each processor. Finally, based on this information and your other system-level answers, the program determines which processor is theoretically best-suited to your application.

Vous aimerez peut-être aussi