Académique Documents
Professionnel Documents
Culture Documents
HYPERTHREADING
DEFINITION:
THREAD: It is a program fragment which is assigned for execution by the multitask
operating system to one of processors of the multiprocessor hardware system. They are
sequences of related instructions or tasks running independently which make up a program.
ADVANTAGES:
If a thread gets a lot of cache misses, the other thread(s) can continue, taking
advantage of the unused computing resources, which thus can lead to faster overall
execution, as these resources would have been idle if only a single thread was
executed.
If a thread cannot use all the computing resources of the CPU (because instructions
depend on each other's result), running another thread permits to not leave these idle.
If several threads work on the same set of data, they can actually share its caching,
leading to better cache usage or synchronization on its values.
DISADVANTAGES:
Multiple threads can interfere with each other when sharing hardware resources such
as caches or translation look aside buffers(TLBs).
Execution times of a single-thread are not improved but can be degraded, even when
only one thread is executing. This is due to slower frequencies and/or additional
pipeline stages that are necessary to accommodate thread-switching hardware.
Hardware support for Multithreading is more visible to software, thus requiring more
changes to both application programs and operating systems than Multiprocessing.
11070220
2
A feature of certain Pentium 4 chips that makes one physical CPU appear as two logical
CPUs. It uses additional registers to overlap two instruction streams in order to achieve an
approximate 30% gain in performance. Multithreaded applications take advantage of the
Hyper-Threaded hardware as they would on any dual-processor system; however, the
performance gain cannot equal that of true dual-processor CPUs.
11070220
3
AN OVERVIEW:
The main difference between the execution environment provided by the Xeon HT
processor, compared with that provided by two traditional single-threaded processors, is that
HT shares certain processor resources: there is only one execution engine, one on-
board cache set, and one system bus interface. This means that the logical processors on
an HT processor must compete for use of these shared resources. As a result, an HT
processor will not provide the same performance capability as two similarly equipped single-
threaded processors.
The two logical processors on an HT processor are treated equally with respect to access to
the shared resources. We are to refer to the logical processors on an HT processor, in order
of use, as the first and second logical processors.
Windows XP and Windows .NET Server include generic identification and support for IA-
32 processors that implement HT using the Intel-defined CPUID instruction identification
mechanism. However, this support is not guaranteed for processors that have not been tested
with these operating systems.
11070220
4
SMT processors may support more than two logical processors in the future. However, the
discussions and examples in this white paper assume the use of two logical processors, as
used in the Xeon family of processors.
_______________________________________________
BASIC WORKING OF HYPERTHREADING:
2. When executing different threads, the processor must "know" which instructions refer
to which threads. That is why there is some mechanism which helps the processor do
it.
3. It is also clear that taking into account a small number of general-purpose registers in
the x86 architecture (8 in all) each thread has its own set of registers. However, this
limitation is evaded by renaming the registers. That is, there are much more physical
11070220
5
registers than logical ones. The Pentium III has 40. The Pentium 4 has obviously
more. According to the unconfirmed information, they are 128.
4. It's also known that when several threads need the same resources or one of the
threads waits for data the "pause" instruction must be applied to avoid a performance
drop. Certainly, this requires recompilation of the programs.
5. Sometimes execution of several threads can worsen the performance. For example,
because the L2 cache is not extendable, and when active threads will try to load the
cache it's possible that the struggle for the cache will result in constant clearing and
reload of data in the L2 cache.
6. Intel states that the gain can reach 30% in case of optimization of programs for this
technology. (Or, rather, Intel states that on the today's server programs and
applications the measured gain is up to 30%) It's a decent reason for the optimization.
11070220
6
When execution resources would not be used by the current task in a processor without
hyper-threading, and especially when the processor is stalled, a hyper-threading
equipped processor can use those execution resources to execute another scheduled
task. (The processor may stall due to a cache miss, branch misprediction, or data
dependency.)
11070220
7
This technology is transparent to operating systems and programs. All that is required
to take advantage of hyper-threading is symmetric multiprocessing (SMP) support in
the operating system, as the logical processors appear as standard separate processors.
Programs are made up of execution threads. These threads are sequences of related
instructions. Earlier, most programs consisted of a single thread. The operating systems
in those days were capable of running only one such program at a time. The result was
that your PC would freeze while it printed a document or a spreadsheet. The system
was incapable of doing two things simultaneously. Innovations in the operating system
introduced multitasking in which one program could be briefly suspended and another
one run. By quickly swapping programs in and out in this manner, the system gave the
appearance of running the programs simultaneously. However, the underlying
processor was, in fact, at all times running just a single thread.
By the beginning of this decade, processor design had gained additional execution
resources (such as logic dedicated to floating-point and integer math) to support
executing multiple instructions in parallel. Intel saw an opportunity in these extra
facilities. The company reasoned it could make better use of these resources by
employing them to execute two separate threads simultaneously on the same processor
core. Intel named this simultaneous processing Hyper-Threading Technology and
released it on the Intel Xeon processors in 2003. According to Intel benchmarks,
applications that were written using multiple threads could see improvements of up to
30% by running on processors with HT Technology. More important, however, two
programs could now run simultaneously on a processor without having to be swapped
in and out (See Figure 3.) To induce the operating system to recognize one processor as
two possible execution pipelines, the new chips were made to appear as two logical
processors to the operating system.
11070220
8
Above diagram portrays difference between an early processor and a not so recent processor.
First consisted of single core which used to handle one program at one time. Multitasking
was not possible.
Second diagram gives an insight into the recent processors which create a
virtual second processor to handle many threads or programs simultaneously.
A third diagram depicts new generation multicore processor in action, executing 4 threads
simultaneously . It provides speed coupled with multitasking abilities at the extreme.
_________________________________________________________________________
COMPONENTS OF HYPERTHREADING:
11070220
9
11070220
10
The return stack predictor is duplicated for accurate tracking of call/return pairs
THREADING ALGORITHMS:
Time-slicing: Time slicing is a technique used for achieving high power-saving effect
on terminal devices. It is based on the time-multiplexed transmission of different
services
A processor switches between threads in fixed time intervals.
High expenses, especially if one of the processes is in the wait state.
Switch-on-event
Task switching in case of long pauses
Waiting for data coming from a relatively slow source, CPU resources are given
to other processes
Multiprocessing
Distribute the load over many processors
Adds extra cost
Simultaneous multi-threading
Multiple threads execute on a single processor without switching.
Basis of Intel’s Hyper-Threading technology
PROCESSORS:
Here are some multitasking workloads that are just too much for a single logical
processor
11070220
12
These benchmarks were taken using popular software packages that are
already multithreaded and shows the percent increase in performance
DESKTOP:
NOTEBOOKS:
11070220
13
Intel® chipsets:
In this design, each core has its own execution pipeline. And each core has the resources
required to run without blocking resources needed by the other software threads.
While the example in Figure 2 shows a two-core design, there is no inherent limitation in the
number of cores that can be placed on a single chip. Intel has committed to shipping dual-
core processors in 2005, but it will add additional cores in the future. Mainframe processors
today use more than two cores, so there is precedent for this kind of development.
11070220
14
FIGURE 5:
The multi-core design enables two or more cores to run at somewhat slower speeds and at
much lower temperatures. The combined throughput of these cores delivers processing
power greater than the maximum available today on single-core processors and at a much
lower level of power consumption. In this way, Intel increases the capabilities of server
platforms as predicted by Moore's Law while the technology no longer pushes the outer
limits of physical constraints
11070220
15
Desktop Laptop
Logo *
Code-named Core Date released Code-named Core Date released
Conroe XE dual (65 nm) Jul 2006 Merom XE dual (65 nm) Jul 2007
Kentsfield XE quad (65 nm) Nov 2006 Penryn XE dual (45 nm) Jan 2008
Yorkfield XE quad (45 nm) Nov 2007 Penryn XE quad (45 nm) Aug 2008
11070220
16
ADVANTAGES:
The advantages of hyper-threading are listed as:
improved reaction and response time. Faster speed and higher efficiency.
The largest boost in performance will likely be noticed while running CPU-intensive
processes, like antivirus scans, playing high end games, ripping/burning media
(requiring file conversion), or searching for folders
DISADVANTAGES:
To take advantage of hyper-threading performance, serial execution can not be used.
Integration of a multi-core chip drives production yields down and they are more
difficult to manage thermally than lower-density single-chip designs. Intel has
partially countered this first problem by creating its quad-core designs by combining
two dual-core on a single die with a unified cache, hence any two working dual-core
dies can be used, as opposed to producing four cores on a single die and requiring all
four to work to produce a quad-core
11070220
17
It suffers from a serious security flaw which permits local information disclosure,
including allowing an unprivileged user to steal an RSA private key being used on the
same machine. Administrators of multi-user systems are strongly advised to take
action to disable Hyper-Threading immediately; single-user systems (i.e., desktop
computers) are not affected. Earlier this year, Intel hyper threading was revealed to
have a security flaw where threads could find information from each other through the
shared cache despite having no access to each other's memory space.
Two processing cores sharing the same system bus and memory bandwidth limits the
real-world performance advantage.
SYMMETRIC MULTIPROCESSING:
Short for Symmetric Multiprocessing, a computer architecture that provides fast performance
by making multiple CPUs available to complete individual processes simultaneously
(multiprocessing). Unlike asymmetrical processing, any idle processor can be assigned any
task, and additional CPUs can be added to improve performance and handle increased loads.
A variety of specialized operating systems and hardware arrangements are available to
support SMP. Specific applications can benefit from SMP if the code allows multithreading.
SMP systems allow any processor to work on any task no matter where the data for that task
are located in memory; with proper operating system support, SMP systems can easily move
tasks between processors to balance the workload efficiently.
____________________________________________________________________________________
SMT is one of the two main implementations of multithreading, the other form being
temporal multithreading. In temporal multithreading, only one thread of instructions
can execute in any given pipeline stage at a time. In simultaneous multithreading,
instructions from more than one thread can be executing in any given pipeline stage at
a time. This is done without great changes to the basic processor architecture: the main
additions needed are the ability to fetch instructions from multiple threads in a cycle,
and a larger register file to hold data from multiple threads. The number of concurrent
threads can be decided by the chip designers, but practical restrictions on chip
complexity have limited the number to two for most SMT implementations.
Because the technique is really an efficiency solution and there is inevitable increased
conflict on shared resources, measuring or agreeing on the effectiveness of the
solution can be difficult. Some researchers have shown that the extra threads can be
used to proactively seed a shared resource like a cache, to improve the performance of
another single thread, and claim this shows that SMT is not just an efficiency solution.
Others use SMT to provide redundant computation, for some level of error detection
and recovery.
However, in most current cases, SMT is about hiding memory latency, efficiency and
increased throughput of computations per amount of hardware used.
____________________________________________________________________
11070220
19
Agglomeration : In the third stage, we move from the abstract toward the concrete.
We revisit decisions made in the partitioning and communication phases with a view
to obtaining an algorithm that will execute efficiently on some class of parallel
computer. In particular, we consider whether it is useful to combine, or agglomerate,
tasks identified by the partitioning phase, so as to provide a smaller number of tasks,
each of greater size. We also determine whether it is worthwhile to replicate data
and/or computation.
Mapping : In the fourth and final stage of the parallel algorithm design process, we
specify where each task is to execute. This mapping problem does not arise on
uniprocessors or on shared-memory computers that provide automatic task scheduling.
On the other hand, on the server side, multicore processors are ideal because they allow
many users to connect to a site simultaneously and have independent threads of execution.
This allows for Web servers and application servers that have much better throughput.
FIGURE 5: Depicts effect of HTT on processor speed and performance using Windows Task
11070220
20
manager
FUTURE:
Older Pentium 4 based CPUs use hyper-threading, but the newer Pentium M based
cores Merom, Conroe, and Woodcrest do not. Hyper-threading is a specialized form of
simultaneous multithreading (SMT).
The Intel Atom is an in-order single-core processor with hyper-threading, for low
power mobile PCs and low-price desktop PCs.
11070220
21
Diagram of a generic dual core processor, with CPU-local level 1 caches, and a shared,
on-die level 2 cache.
HT Technology enables gaming enthusiasts to play the latest titles and experience
ultra-realistic effects and game play. And multimedia enthusiasts can create, edit, and
encode graphically intensive files while running background applications such as virus
scan in the background–all without slowing down. Intel released the Nehalem (Core
i7) in November 2008 in which hyper-threading makes a return. Nehalem contains 4
cores and effectively scales 8 threads and has a speed of 3.4 GHz thus providing an
out of the world performance.
It will improve CPU intensive processes, virus scans, ripping and burning CDs and
DVDs, multimedia applications, video quality, improve connectivity between
processor itself .
Provide faster response times for Internet and e-Business applications, enhancing
customer experiences
11070220
22
11070220