Vous êtes sur la page 1sur 20

Computer Vision and Image Understanding 98 (2005) 104123 www.elsevier.

com/locate/cviu

An embedded system for an eye-detection sensor


Arnon Amira,*, Lior Zimetb, Alberto Sangiovanni-Vincentellic, Sean Kaoc
b a IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA University of California Santa Cruz, 1156 High Street, Santa Cruz, CA 95064, USA c Department of EECS, University of California Berkeley, CA 94720, USA

Received 27 July 2004; accepted 27 July 2004 Available online 1 October 2004

Abstract Real-time eye detection is important for many HCI applications, including eye-gaze tracking, autostereoscopic displays, video conferencing, face detection, and recognition. Current commercial and research systems use software implementation and require a dedicated computer for the image-processing taska large, expensive, and complicated-to-use solution. In order to make eye-gaze tracking ubiquitous, the system complexity, size, and price must be substantially reduced. This paper presents a hardware-based embedded system for eye detection, implemented using simple logic gates, with no CPU and no addressable frame buers. The image-processing algorithm was redesigned to enable highly parallel, single-pass image-processing implementation. A prototype system uses a CMOS digital imaging sensor and an FPGA for the image processing. It processes 640 480 progressive scan frames at a 60 fps rate, and outputs a compact list of sub-pixel accurate (x, y) eyes coordinates via USB communication. Experimentation with detection of human eyes and synthetic targets are reported. This new logic design, operating at the sensors pixel clock, is suitable for single-chip eye detection and eye-gaze tracking sensors, thus making an important step towards mass production, low cost systems. 2004 Published by Elsevier Inc.
Keywords: Eye detection; Eye-gaze tracking; Real-time image processing; FPGA-based image processing; Embedded systems design

Corresponding author. Fax: +1 408-927-3033. E-mail addresses: arnon@almaden.ibm.com (A. Amir), liorz@zoran.com (L. Zimet), alberto@eecs. berkeley.edu (A. Sangiovanni-Vincentelli), kaos@ 1077-3142/$ - see front matter 2004 Published by Elsevier Inc. doi:10.1016/j.cviu.2004.07.009

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

105

1. Introduction Eye detection, the task of nding and locating eyes in images, is used in a great number of applications. One example is eye-gaze tracking systems, used in HCI for a variety of applications [11,10,5,21,22], such as pointing and selection [12], activating commands (e.g. [19]), and combinations with other pointing devices [26]. Other examples include real-time systems for face detection, face recognition, iris recognition, eye contact correction in video conferencing, autostereoscopic displays, facial expression recognition, and more. All those applications require eye detection for their work. Consider for example commercial eye-gaze tracking systems. Current implementations of eye-gaze tracking systems are software driven and require a dedicated high-end PC for image processing. Yet, even high-end PCs can hardly support the desired high frame rates (about 200400 fps) and high-resolution images required to maintain accuracy in a wide operational eld of view. Even in cases where the performance requirement are met, these solutions are bulky and prohibitively expensive thus limiting their reach in a very promising market. Any CPU-based implementation of real-time image-processing algorithm has two major bottlenecksdata transfer bandwidth and sequential data processing rate. First, video frames are captured, from either an analog or a digital camera, and transferred to the CPU main memory. Moving complete frames around imposes a data ow bandwidth bottleneck. For example, an IEEE 1394 (Firewire) channel can transfer data at 400 Mbits/s, or 40 MPixel/s at 10 bits/Pixel. At a 60 fps, with frame size of 640 480 pixels, our sensors pixel rate is already 30 MPixel/s. Faster frame rates or larger frame size would require special frame capturing devices. After a frame is grabbed and moved to memory, the CPU can sequentially process the pixels. A CPU adds large overhead to the actual computation. Its time splits among moving data between memory and registers, applying arithmetic and logic operators on the data, and keeping the algorithm ow, handling branches, and fetching code. For example, if each pixel is accessed only 10 times along the entire computation (e.g., applying a single 3 3 lter), it would require the CPU to run at least 10 times faster than the input pixel rate, and even faster due to CPU overhead. Hence, the CPU has to run much faster than the sensor pixel rate. Last, a relatively long latency is imposed by the need to wait for an entire frame to be captured before it is processed. Device latency is extremely noticeable in interactive user interfaces. The advantage of a software-based implementation lies in its exibility and short implementation time. A large quantities hardware implementation would certainly reduce unit cost and size, but at the price of increased design time and rigidity. An ecient hardware implementation should oer the same functionality and exploit its main characteristic: concurrency. A simple mapping of the software implementation into hardware components falls short of the potential benets oered by a hardware solution. Rather, the algorithms and data structures have to be redesigned, implementing the same mathematical computation using maximum concurrency. Our approach to embedded system design is based on a methodology [20,2] that captures the design at a high level of abstraction where maximum concurrency may

106

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

be exposed. This representation is then mapped into a particular architecture and evaluated in terms of cost, size, power consumption, and performance. The methodology oers a continuous trade-o among possible implementations including, at the extremes, a full software and a full hardware realization. This paper presents a prototype of an embedded eye-detection system implemented completely in hardware. The novelty of this paper is the redesign of a sequential pupil detection algorithm into a synchronous parallel algorithm, capable of fully exploiting the concurrency of hardware implementations. The algorithm includes computation of regions (connected components), shape moments and simple shape classicationall of which are computed within a single pass over the frame, driven by the real-time pixel-clock of the imaging sensor. The output is a list of pupil location and size, passed via low-bandwidth USB connection to an application. In addition to the elimination of a dedicated PC and thus major saving in cost, the system has virtually no limit in processing pixel rate, bounded only by the sensors output pixel rate. Minimal latency is achieved as pixels are processed while being received from the sensor. A prototype was built using a eld programmable gate array (FPGA). It processes 640 480 progressive scan frames at 60 fpsa pixel rate that is 16 times faster than the one reported elsewhere [16,15]. By extending the algorithm to further detect glints and by integrating this logic into imaging sensors VLSI, this approach can solve the speed, size, complexity, and cost of current eye-gaze tracking systems.

2. Prior work on eye detection Much of the eye-detection literature is associated with face detection and face recognition (see, e.g. [9,27]). Direct eye-detection methods search for eyes without prior information about face location, and can further be classied into passive and active methods. Passive eye detectors work on images taken in natural scenes, without any special illumination and therefore can be applied to movies, broadcast news, etc. One such example exploits gradient eld and temporal analysis to detect eyes in gray-level video [14]. Active eye-detection methods use special illumination and thus are applicable to real-time situations in controlled environments, such as eye-gaze tracking, iris recognition, and video conferencing. They take advantage of the retro-reection property of the eye, a property that is rarely seen in any other natural objects. When light falls on the eye, part of it is reected back, through the pupil, in a very narrow beam pointing directly towards the light source. When a light source is located very close to a camera focal axis (on-axis light), the captured image shows a very bright pupil [11,25]. This is often seen as the red-eye eect when using a ashlight in stills photography. When a light source is located away from the camera focal axis (o-axis light), the image shows a dark pupil. This is the reason for making the ashlight unit pop up in many camera models. However, neither of these lights allow for good discrimination of pupils from other objects, as there are also other bright and dark objects in the scene that would generate pupil-like regions in the image.

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

107

2.1. Eye detection by frame subtraction using two illuminations The active eye-detection method used here is based on the subtraction scheme with two synchronized illuminators [23,6,16]. At the core of this approach, two frames are captured, one with on-axis illumination and the other with o-axis illumination. Then the illumination intensity of the o-axis frame is subtracted from the intensity of the on-axis frame. Most of the scene, except pupils, reects the same amount of light under both illuminations and subtracts out. Pupils, however, are very bright in one frame and dark in the other, and thus are detected as elliptic regions after applying a threshold on the dierence frame. False positives might show due to specularities from curved shiny surfaces such as glasses, and at the boundaries of moving objects due to the delay between the two frames. Tomono [23] used a modied three-CCD camera with two narrow band pass lters on the sensors and a polarized lter with two illuminators in the corresponding wavelengths, the on-axisbeing polarized. This conguration allowed for simultaneous frames capture. It overcame the motion problem and most of the specularities problems and assist the glint detection. However, it required a more expensive, three-CCD camera. The frame-subtraction eye-detection scheme has become popular in dierent applications because of its robustness and simplicity. It was used for tracking eyes, eyebrow and other facial features [8,13], for facial expression recognition [7], tracking head nod and shake, recognition of head gesture [4], achieving eye contact in video conferencing [24], stereoscopic displays [18], and more. The increasing popularity and multitude of applications suggest that an eye-detection sensor would be of great value. An embedded eye-detection system would simplify the implementation of all of these applications. It would eliminate the need for high-bandwidth image data transfer between imaging sensor and the PC, reduce the high CPU requirements for software-based image processing, and would dramatically reduce their cost. This eye-detection algorithm, selected here for this challenging new type of implementation, is fairly simple and robust. While a software implementation of this algorithm is straightforward, a logic circuit-based implementation, with no CPU, operating at a processing rate of single cycle per pixel provides enough interesting challenges to deal with. Alternative algorithms might have other advantages, however, the simplicity of the selected algorithm made it suitable for a logic-only implementation, without use of any CPU/DSP or microcode of any kind. Some of the lessons learned here may help to implement other, more complicated image-processing algorithms in hardware as well. 2.2. The basic sequential algorithm The frame-subtraction algorithm for eye detection is summarized in Fig. 1. A detailed software-based implementation was reported elsewhere [16]. The on-axis and o-axis light sources alternate, synchronized with the camera frames. After capturing a pair of frames, subtracting the frame associated with the o-axis from the one associated with the on-axis light and thresholding the result, a binary image is obtained.

108

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

Fig. 1. Processing steps of the basic sequential algorithm.

The binary image contains regions of signicant dierence. These regions are detected using a connected components algorithm [1]. The detected regions may include not only pupils but also some false positives, mainly due to object motion and specularities. The non-pupil regions are ltered in step 6, and the remaining pupil candidates are reported to the output. The rst ve steps of the algorithm are applied on frames. They consume most of the computational time and large memory arrays for frame buers. In step 5 the detected regions are represented by their properties, including area, bounding box, and rst-order moments. The ltering in step 6 rejects regions that do not look pupils. Pupils are expected to show as elliptic shapes in a certain size range. Typical motion artifacts show as elongated shapes of large size and are in general easy to distinguish from pupils.1 At the last step, the centroids and size of the detected pupils are reported. This output can be used by dierent applications. 3. Synchronous pupil detection algorithm To use a synchronous hardware implementation, it was necessary to modify the basic sequential algorithm of Section 2.2. While a software implementation is
1

More elaborate eye classication methods were applied in [3].

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

109

straightforward, a hardware implementation using only registers and logic gates operating at pixel-rate clock is far more complicated. The slow clock rate requires any pixel operation to be performed in a single clock and any line operation to be nished within the number of clock cycles equivalent to the number of pixels per image line. In fact, the sensors pixel clock is the theoretically lowest possible processing clock. Using the sensors clock rate has advantage in synchronization, low cost hardware and potential to use much faster sensors. Fig. 2 shows a block diagram of the parallel algorithm. The input is a stream of pixels. There is only a single frame buer (delay), used to store the previous frame for the frame-to-frame subtraction. The output of the frame subtraction is stored in a line buer as it is being computed in a raster scan order. The output is a set of XY locations within the frame region of the centroid of the detected pupils. This set of XY locations is updated once per frame. Connected components are computed using a single-pass algorithm, similar to the region-coloring algorithm [1]. Here, however, the algorithm operates in three parallel stages, running in a pipeline on three lines: when line Li is present in the line buer it is scanned, and line components, or just components, are detected. At the same time, components from line Li 1 are being connected (merged) to components from line Li 2, and those line components are being updated in the frame regions table. The shape properties, including bounding box and moments (step 5, Fig. 1), are computed simultaneously with the computation of the connected components (step 4). This saves time and eliminates the need for an extra frame buer, as being used between step 4 and step 5 in Fig. 1. Consider the computation of a property f (s) of a region s. The region s is being discovered on a line-by-line basis and is composed of multiple line components. For each line object that belongs to s, say s1, the property f (s1) is computed and stored with the line object. When s1 is merged with another object, s2, their properties are combined to reect the properties of the combined region s = s1 [ s2. This type of computation can be recursively or iteratively applied to any shape property that can be computed from its value over parts of the shape. The property f (s) can be computed using the form: f s gf s1 ; f s2 ;

Fig. 2. System block diagram, showing the main hardware components and pipeline processing.

110

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

where g (,) is an appropriate aggregation function for the property f. For example, aggregation of moments is simply the sum, mij gmij ; mij mij mij , whereas s1 s2 s1 s2 s for the aggregation of normalized moments, the regions areas As1, As2 are needed: s  s1  s2  s1  s2 mij gmij ; mij As1 mij As2 mij =As1 As2 : The four properties which dene the bounding box, namely X min ; X max ; Y min ; Y max , s s s s are aggregated using, e.g., X min X min ; X min . s s1 s2 Properties that are computed here in this manner include area, rst-order moments, and bounding box. Other useful properties that may be computed in this manner are higher-order moments, average color, color histograms, etc. In the pipelined architecture, each stage of the pipeline performs the computation on the intermediate results from the previous stage, using the same pixel clock. The memory arrays in the processing stages keep the information of only one or two lines at a time. Additional memory arrays are used to keep components properties and components location, and to form the pipeline. To get the eect of bright and dark pupils in consecutive frames, the logic circuit drives control signals to the on-axis and o-axis infrared light sources. These signals are being generated from the frame clock coming from the image sensor. 3.1. Pixels subtraction and thresholding The digital video stream is coming from the sensor at a pixel rate of 30 Mpixels/s. The data are latched in the logic circuit and at the same time pushed into a FIFO frame buer. The frame buer acts as one full frame delay. Fig. 3 depicts the subtraction and threshold module operation. Since each consecutive pair of frame is processed, the order of subtraction changes at the beginning of every frame. The on-axis and o-axis illumination signals are used to coordinate the subtraction order. To eliminate the handling of negative numbers, a xed oset value is added to the gray levels of all pixels of the on-axis frame. The subtraction operation is done pixel-by-pixel and the result is then compared to a threshold that may be modied for dierent light conditions and noise environments. The binary result of the threshold operation is kept in a line buer array that is transferred to the Line Components Properties module on the next line clock.

Fig. 3. Subtraction and thresholding.

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

111

3.2. Detecting line components and their properties Working at the sensor pixel clock, this module is triggered every line clock to nd the components properties in the binary line buer resulting from the subtraction and threshold module. Every black pixel in the line is part of a component, and consecutive black pixels are aggregated to one component. Several properties are calculated for each of these components: the starting pixel, xstart, the ending pixel, xend, and rst-order moments. The output data structure from this module is composed of several memory arrays that contain the components properties. For a 640 480 frame size, the arrays with the corresponding bits-width are: xstart (10 bits), xend (10 bits), M00 (6 bits), M10 (15 bits), and M01 (15 bits). Higher-order moments would require additional arrays. Preliminary ltering of oversized objects can be done in this stage using the information in M00, since it contains the number of pixels in the component. 3.3. Computing connected components The line components (and their properties) are merged with components of previous lines to form connected regions in the frame. The connect components module is triggered every line clock, and works on two consecutive lines. It receives the xstart and xend arrays and provides at the output a component identication number (ID) for every component and a list of merged components. Since the module works on pairs of consecutive lines, every line is processed twice in this module, rst with the line above it and then with the line below. After being processed in this module, the line information is no longer needed and is overwritten by the next line to be processed. Interestingly, there is never any representation of the boundary of a region throughout the entire process. As regions are scanned line-by-line through the pipeline, their properties are computed and held. But their shape is never stored, not even as a temporary binary image, as is a common practice with sequential implementations. Moreover, if regions or boundaries would need to be stored, then a large data structure, such as a frame buer, would be required. Filling such a data structure and revisiting it after the frame is processed would have caused unnecessary additional complexity and would increase system latency. By eliminating any explicit shape representation in our design, a region is fully detected and all of its properties are available just three line-clocks after it was output from the imaging sensor. If we compare this to a computer-based implementation, then by the time the computer would have nished grabbing a frame, this system would have already nished processing that frame and reported all the region centers via the USB line. The processing uses the xstart and xend information to check for neighboring pixels between two line components in the two lines. The following condition is used to detect all (four-) neighbors line components: xC start 6 xP end and xC end P xP start ; where the subscripts P, C refer to the previous and current lines, respectively. Examples for such regions are shown in Fig. 4. Using this condition, the algorithm checks

112

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

Fig. 4. Four cases of overlap between two consecutive line components are identied by their startend location relationships. This merging operates on two line-component lists (as opposed to pixel lines).

for overlap between every component in the current line against all the components in the previous line. In software, this part can be easily implemented as a nested loop. In hardware, the implementation involves two counters with a synchronization mechanism to implement the loops. If a connected component is found, the next internal loop can start from the last connected component in the previous line, saving computation time. Since the module is reset every line clock and each iteration in the loop takes one clock, the number of components per line is limited to the square root of the number of pixels in a line. For a 640 480 frame size it can accommodate up to 25 objects per line. A more ecient, linear time merging algorithm exists, but its implementation in hardware was somewhat more dicult. A component from the current line that is connected to a component from the previous line is assigned with the same component ID as the one from the previous line. Components in the current line, which are not connected to any component in the previous line, are assigned with new IDs. In some cases, a component in the current line connects two or more components from the previous line. Fig. 5 shows an example of three consecutive lines with dierent cases of connected components, and a merge. Obviously, the merged components should all get the same ID. The middle component, which is initially assigned with ID = 3, would no longer need that ID after the merge. This ID can then be re-used. This is important in order to minimize the required size of arrays for component properties. In order to keep the ID list manageable and within the size constraint in the hardware, it was implemented as a linked-list (in hardware). The list pointer is updated on every issue of a new ID or whenever a merge occurred. The ID list is kept non-fragmented. To support merging components, the algorithm keeps a merge list that is used in the next module in the pipeline. The merge list has an entry for each component with the component ID to be merged with, or zero if no merge is required for this component. The actual region merging is handled by the Regions List module.

Fig. 5. An example in which two regions are merged into one at the time of processing the lower pair of lines. The middle region, which was rst assigned with ID = 3, is merged with the region of ID = 1, their moments and bounding box properties are merged, and ID 3 is recycled.

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

113

3.4. The regions list module This module maintains the list of regions in the frame. It is triggered every line clock, and receives lists of components IDs, of merged components, and the moments of components that were processed in the connected components module. It keeps a list of regions and their corresponding moments for the entire frame, along with a valid bit for each entry. Since regions are forming as the system advances through lines of the frame, the moments from each line are being updated to the correct entry according to the ID of the components. Moments of components with the same ID as the region are simply added. When two regions merge, their moments of one region are similarly added to those of the other region, the result is stored in the region with the lower ID, and the other record of the other region is cleared and is made available for a new region to come. The list of components of the entire frame and their moments is ready at the time of receiving the last pixel of the frame. 3.5. Reporting the detected regions Once the list of moments of the entire frame is ready, the system checks the shape properties of each region and decides if a region is a valid pupil. The centroid of a region is then computed using: Xc M 10 M 00 Yc M 10 : M 00

This module is triggered every frame clock. In our current implementation it copies the list of moments computed in the previous module, to allow concurrent processing of the next frame, and prepares a list of regions centroids. This could also be done per each nished region, while still processing the rest of the frame. This compact regions list can now be easily transferred to an application system, such as a PC. For a frame of size 640 480 pixels, any pixel location can be expressed by 10 + 9 = 19 bits of information. However, since the division operation provides us with sub-pixel accuracy, we add 3 bits to each coordinate, using xed precision sub-pixel location format. A universal serial bus (USB) connection is used to transfer the pupil locations and size to a PC. The Valid bit for the centroids list is updated every time the list is ready. This bit is poled by the USB to ensure correct locations read from the hardware.

4. Simulink model The functionality of the system is captured at an abstract level using a rigorously dened model of computation. Models of computation are formalisms that allow capturing important formal properties of algorithms. For example, control algorithms can often be represented as nite-state machinesa formal model that has been the object of intense study over the years. Formal verication of properties such as safety, deadlock free, and fairness is possible without resorting to expensive sim-

114

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

ulation runs on empirical models. Analyzing the design at this level of abstraction, represented by the Simulink step in Fig. 6, allows nding and correcting errors early in the design cycle, rather than waiting for a full implementation to verify the system. Verifying a system at the implementation level often results in long and tedious debugging sessions where functional errors may be well hidden and dicult to disentangle from implementation errors. In addition to powerful verication techniques, models of computation oer a way of synthesizing ecient implementations. In the case of nite-state machines, a number of tools are available on the market to either generate code or to generate Register Transfer Level descriptions that can be then mapped eciently into a library of standard cells or other hardware components. 4.1. Synchronous dataow model This system processes large amounts of data in a pipeline and is synchronized with a pixel clock. A synchronous dataow model of computation is the most appropriate for this type of application. A synchronous dataow is composed of actors or executable modules. Each module consumes tokens on its inputs and produces tokens on its outputs. Actors are connected by edges and transfer tokens in a rst-in rstout manner. Furthermore, actors may only re when there is a sucient number of tokens on all inputs for a particular actor. All of the major system modules described in Section 3 are modeled as actors in the dataow. Data busses, which convey data between modules, are modeled as edges connecting the actors. Several engineering design automation (EDA) tools can be used to analyze and simulate this model of computation. We decided to use Simulink over other tools such as Ptolemy, System C, and the Metropolis meta-model because of its ease of use, maturity, and functionality. Simulink provides a exible tool that contains a great deal of built-in support as well as versatility. The Matlab toolbox suite, that includes arithmetic libraries as well

Fig. 6. First the system behavior model is designed and tested. Then the hardware/software architecture is set, and the functions are mapped onto the selected architecture. Last, this model is rened and converted to actual system implementation.

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

115

as powerful numerical and algebraic manipulation, is linked into Simulink, thus providing Simulink with methods for eciently integrating complex sequential algorithms into pipeline models and allowing synchronization through explicit clock connections. The image subtraction and thresholding processes corresponding pixels from the current and previous frames on a pixel-by-pixel basis. Input tokens for this actor represent pixels from the sensor and the frame buer, while output tokens represent thresholded pixels. The connected components module accumulates regions data over image lines. This modules actor res only when it has accumulated enough pixel data for a complete line. It outputs a single token per line, containing all the data structures for the connected components list, including object ID tags, components moments, and whether they are connected to any other components, processed on previous lines. These line tokens are passed to a module that updates the properties of all known regions in the frame. Input is received on a line-by-line basis from the connect components module and regions in the list are updated accordingly. This module accumulates the data about possible pupil regions over an entire frame of data. Thus, one output token is generated per frame. The frames regions data are then passed to an actor that calculates the XY centroid of the regions using a Matlab script. The XY data output is generated for every frame. The executable model of computation supported by Simulink provides a convenient framework to detect and correct errors in the algorithms and in the data passing. It forces the system to use a consistent timing scheme for passing data between processing modules. Finally, this model provided us with a basis for a hardware implementation, taking advantage of a pipeline scheme that could process dierent parts of a frame at the same time.

5. Hardware implementation After verifying the functional specication of the model using Simulink, hardware components are selected and the logic circuit is implemented in HDL, according to the hardware blocks depicted in Fig. 2. A Zoran sensor and Application Development board was used for the prototype implementation, as shown in Fig. 7. This board was handy and was sucient for the task. It has the right means for connecting the sensor and has USB connectivity for transferring the processing results. The use of an available hardware development platform minimizes the required hardware work and saves both time and money that would otherwise be required for a new PCB design. The Zoran board contains three Apex20K FPGA-s and several other components, much of it left here unused. The algorithm implementation consists of combinatorial logic and registers only and does not make any use of the special features the Altera Apex20K provides, such as PLLs and DSP blocks. The density and timing performance of the Altera device allows real-time operation in the sensors 30 MHz pixel rate with hardly

116

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

Fig. 7. System prototype. The lens is circled with the on-axis IR LEDs. Most of the development board is left unused, except of one FPGA and a frame buer.

any optimization of the HDL code. A potentially better implementation, with much fewer components and possibly higher performance could be achieved by a dedicated PCB design. However, the design of a new dedicated PCB was beyond the scope of this project. Moreover, since the designed logic operates at pixel clock, the most suitable mass-production implementation would be to eliminate the FPGA altogether and to place the entire logic on the same silicon with the CMOS sensor, thus making it a single-chip eye-detection sensor. This would be the most suitable and cost eective commercial implementation for mass production. The image capture device is a Zoran 1.3 Mega pixel CMOS Sensor, mounted with a 4.8 mm/F2.8 Sunex DSL602 lens providing 59 eld of view. It provides a parallel video stream of 10 bits per pixel at up to 30 MPixels/s (a 30 MHz pixel clock), along with pixel-, line-, and frame-clocks. In video mode, the sensor can provide up to 60 progressive scan frames per second at NTSC resolution of 640 480. The control over the sensor is done through an I2C bus that can congure a set of 20 registers in the sensor to change dierent parameters, such as exposure time, image decimation, analog gain, line integration time, sub-window selection, and more. Each of the two near-infrared light sources is composed of a set of 12 HSDL-4420 subminiature LEDs. This is equivalent to the light reported elsewhere [16]. The LEDs are driven by power transistors controlled from the logic circuit to synchronize the on-axis, o-axis illumination to the CMOS sensor frame rate. The LEDs illuminate at near-infrared wavelength of 875 nm and are non-obtrusive to human, yet they exploit a relatively sensitive region of the imaging sensor spectrum (see Fig. 8, Fig. 9). The video is streamed to the FPGA where the complete algorithm was implemented. Only one of the three on-board available Altera Apex20K devices was used. It provided enough logic gates and operates at the required pixel-rate running speed. The FPGA controls the frame buer FIFO (Averlogic AL422B device with an easy

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

117

Fig. 8. Results of one experiment. The detected (x, y) coordinates of four linearly moving targets along approx. 1500 frames are marked with blue + signs (forming a wide line due to their high plotting density). The deviation from the four superimposed second degree polynomial ts (red) is hardly noticeable due to the achieved sub-pixel accuracy. See Table 1 for the calculated error rates. (For interpretation of the references to color in this gure legend, the reader is referred to the web version of this paper.)

Fig. 9. Synthetic target (+ sign) and eye (. sign) detections are shown at 4 ft (left graph, total of 980 frames) and 6 ft (right graph, 870 frames) experiments. The eye-detection trajectory follows the head motion. For this subject, S13, head motion was much more noticeable at the second, 6 ft session. The synthetic targets are static and thus show no trajectory.

read and write pointers control). Communication with the application device, a PC computer in this case, is done through a Cypress USB1.1 controller. The USB provides an easy way to transfer pupil data to the PC for display and verication. It also

118

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

Fig. 10. One frame of the systems monitor video output, generated by the FPGA, passed to an on board VGA converter, and projected at 60 fps on a white screen. This optional video channel is only used for debugging and demonstration purposes and is not required for regular operation. The gray-level input frame is directed to the red channel and detected pupils are superimposed as bright green marks. Image is captured by a digital camera from the projected screen. (For interpretation of the references to color in this gure legend, the reader is referred to the web version of this paper.)

allows easy implementation of the I2C communication and control of the sensor and of algorithm parameters. One limitation of the original prototype design was the lack of any video output, e.g., to a monitor. This was later solved using a VGA converter, also found on the development board. While this part has no role in the system operation, it is indispensable at the development and debugging phases. This is the only means to display the captured image, necessary to enable even the most basic operations such as lens focus and aperture adjustments. Since the FPGA is directly connected to the sensor, the video output must be originated by the FPGA. This video signal drives the red color channel of the VGA output (see Fig. 10).

6. Experimentation The prototype system was tested on both synthetic targets and human eyes. The synthetic targets provide accurate and controlled motion, xed pupil size and no

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

119

blinking. Experiments with detection of human eyes provide some quantitative and qualitative data from realistic detection scenarios, however with no ground truth information to compare with. Synthetic targets consist of circular patches, 7 mm in diameter, made of a red reective material of the kind used for car safety reective stickers. These stickers are directional reectors, mimicking the eye retro-reection. Their reection is in general brighter than of a human eye, and can be detected by the prototype system at distances up to 12 ft. Two pairs of targets were placed on a piece of cardboard, attached to horizontal bearing tracks, and manually moved along a 25 cm straight line in front of the sensor, at dierent locations and distances (about 4060 cm). The system output (x, y) coordinates of the targets was logged and analyzed. Fig. 7 shows the result of one such test. Table 1 summarizes the results of 10 experiments, total of 15,850 frames. The average false positive detection rate is 1.07%. The average misses rate is 0.2%. Position errors are measured by the uctuation of data points from a second-order polynomial t to all the point locations of each of the targets in each experiment (ideally a linear t would do, however, due to lens distortion the trajectories are not linear in the image space). The RMS error is 0.9 pixels, and the average absolute error is 0.34 pixels. Note that the position error computation incorporates all outliers. Target 3 has a noticeably higher RMS error than the others. This was later found to be a result of using a rolling shutter. It was causing an overlap between the exposure time of pixels of on-axis and o-axis frames. We were able to overcome this problem later using sub-frame selection of a 640 480 region rather than the originally used frame decimation mode. However, a sensor with xed exposure would have an advantage in such an eye-detection system. A second lab experiment was conducted with human subjects. Four male subjects participated, of which two are White (S10, S13) and two are Asian (S11, S14). The experiment was done at two distances, 4 and 6 ft away from the sensor. Each subject was recorded at each distance for three sessions of about 35 s each, in which he was asked not to blink and not to move his head. A reference target of two circles was placed on the wall behind the subjects and was expected to be detected at each frame in the same location. The detection results of human eyes are summarized in Table 2. Thus we have only four subjects, two trends can be observed. First, the average eye misses rate at 6 ft was nearly as twice higher as at 4 ft. Second, while detection of Asian people

Table 1 Experimentation results with synthetic targets Target 1 2 3 4 Average Deletion rate 0.00000 0.00006 0.00378 0.00398 0.00195 Insertion rate 0.00151 0.01154 0.02415 0.00543 0.01066 Position err (RMS Pixels) 0.344 0.566 1.803 0.884 0.899 Position err (abs Pixels) 0.251 0.290 0.474 0.358 0.343

120

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

Table 2 Experimentation results with detection of human eyes and synthetic targets s10 Age Race Target misses rate (8 ft) Eye misses rate (4 ft) Eye misses rate (6 ft) 24 White 0.0994 0.3825 0.1719 s11 34 Asian 0.0398 0.2906 0.6129 s13 41 White 0.0003 0.1633 0.5552 s14 28 Asian 0.1638 0.2596 0.768 White 0.0498 0.2729 0.3636 Asian 0.1018 0.2751 0.6905 Average 0.0758 0.274 0.527

eyes was similar to White people eyes at the 4 ft distance, it was much worse at the 6 ft distance. This qualitative dierence between Asian and White people is in agreement with the lower intensity of IR reected from eyes of Asian people [17]. Note that despite of the request not to blink, the eye misses rate does include several blinks, at both distances, which greatly aect the reported results. The reference target, located at 8 ft distance was missed a few times, sometimes due to occlusion by the subject head, and in other cases due to noise, but in general was well detected. Our experiment with human eyes was aected by the lack of ground truth. Without tracking, the system misses the eyes during eye blinks (no tracking between frames is used in any of the experiments). Eye location error cannot be evaluated as the true eye location is dynamic, changing with head motion and saccades. However, the experimentation with both synthetic and human eyes in covers both aspects of the system in a quantitative way. The system was demonstrated at the CVPR 2003 demo session [28]. Several hundred people passed in front of the system during 8 h of demonstration. A typical frame is shown in Fig. 10. The system worked continuously without a single reset, and was able to detect most of the people at distances between 2 and 5 ft. This prototype system is found noticeably less sensitive than previous, CCDequipped, software-based implementation we experimented with [16]. One of the reasons is the faster frame rate and the rolling shutter, which impose a shorter exposure time. Another reason is due to the optics and the lower sensitivity of the CMOS sensor. There is a tradeo between the lens speed (or aperture), the eld of view width, the amount of IR illumination, the exposure time (thus the maximal frame rate) and the overall system sensitivity. In this rst prototype we focused on making the processing work and have not exploited the full potential of a dedicated optical design of the system. It is expected that with continuous improvement of CMOS image sensors it will be possible to increase system performance.

7. Conclusions and future work The prototype system developed in this work demonstrates the ability to implement eye detection using simple logic. The logic can be pushed into the sensor chip, simplifying the connection between the two and providing a single-chip eye-detection sensor. Eliminating the need for image processing on the CPU is an important step towards mass production of commercial eye detectors. We believe that this out of the

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

121

box thinking and novel implementation will promote the development of more aordable, yet high performance embedded eye-gaze tracking systems. The current system is only doing eye detection, not eye-gaze tracking. In order to support eye-gaze tracking, more features need to be extracted from the image, such as cornea glints (rst Purkinje image). Glints detection may be implemented in a similar manner to the pupil detection. Once the glint and pupil centers are extracted from the frame, a calibration-free eye-gaze computation can be carried on a computer or a PDA with minimal processing load. The proposed single-pass, highly parallel connected components algorithm, which is coupled with moments and shape properties computation, is suitable for a software implementation as well. This general-purpose algorithm would make an extremely ecient computation for real-time image analysis of various detection and tracking tasks. The advantages of hardware implementation over software/CPU are lower manufacturing price, ability to work at very high frame rates with a low processing clock, equal to the sensors pixel clock, and low operational complexity (it does not crash). The disadvantages are lower exibility for changes, more complicated development cycle and limited monitoring ability. This approach has a signicant value for low cost, mass-production manufacturing. Of a particular interest is the achieved processing speed. Operating at just a single computation cycle per pixel clock, this logic design achieves the theoretical computational lower bound for this task. Hence, the system is able to process relatively high data rates at very low processing clock. For example, at the current operating pixel rate of 30 MHz, and dynamic range of 10 bits/pixel, already represents a data rate of 300 Mbps. This rate is quickly approaching the capacity limit of conventional digital input channels such Firewire (400 Mbps) and USB-2 (480 Mbps), eliminated here altogether. This highly parallel, low clock rate implementation is a proof of concept for the great potential in hardware-based image processing and may result with a very low cost, high performance, mass-production system. In this paper, we explored one extreme of the possible range of implementation: a full hardware solution. Most prior work was based on software-only implementations. Our design methodology, based on the capture of the processing algorithms at a high level of abstraction, is useful for exploration of a much wider range of implementations, from a full software realization to intermediate solutions where part of the functionality of the algorithm is implemented in hardware and others in software. We wish to compare these intermediate implementations and to use automatic synthesis techniques to optimize the design and its implementations. The Metropolis framework under development at the Department of EECS of the University of California at Berkeley under the sponsorship of the Gigascale System Research Center can be used as an alternative to or in coordination with SimulinkMatlab to formally verify design properties and to evaluate architectures early in the design process as well as to automatically synthesize implementations. Future work may improve the system in several aspects. Work at high frame rates requires very sensitive sensors due to the short exposure time. This limits the distance range of eye detection in the current prototype. This decit may be addressed by a

122

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

more appropriate optics. Higher-order moments would improve ltering of false positive pupil candidates. A xed shutter would help to reduce the IR illumination cycle and increase consistency illumination across the image. Altering between decimation and windowing at dierent resolutions can yield high frame rate tracking of known eyes and high resolution search for new targets. Last, adding the detection of cornea glints to the pupil detector would allow to compute gaze direction, hence making it a single-chip eye-gaze tracking sensor.

Acknowledgments We thank Zoran CMOS Sensor team and Video IP core team for providing the sensor module and for the support. We are also grateful to Tamar Kariv for the help with the USB drivers and PC application, and to Ran Kalechman and Sharon Ulman for their helpful advice with the logic synthesis.

References
[1] D.H. Ballard, C.M. Brown, Computer Vision, Prentice Hall, New Jersey, 1982, pp. 150152. [2] F. Balarin, L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, M. Sgroi, Y. Watanabe, Modeling and designing heterogeneous systems, in: J. Cortadella, A. Yakovlev (Eds.), LNCS: Advances in Concurrency and Hardware Design, Springer-Verlag, Berlin, 2002, pp. 228273. [3] A. Cozzi, M. Flickner, J. Mao, S. Vaithyanathan, A comparison of classiers for real-time eye detection, in: ICANN, 2001, pp. 993999. [4] J.W. Davis, S. Vaks, A perceptual user interface for recognizing head gesture acknowledgements, in: IEEE PUI, Orlando, FL, 2001. [5] Andrew T. Duchowski (Ed.), Proc. Symp. on Eye Tracking Research and Applications, March 2002, Louisiana. [6] Y. Ebisawa, S. Satoh, Eectiveness of pupil area detection technique using two light sources and image dierence method, in: A.Y.J. Szeto, R.M. Rangayan (Eds.), Proc. 15th Annu. Internat. Conf. of the IEEE Engineering in Medicine and Biology Society, San Diego, CA, 1993, pp. 12681269. [7] I. Haritaoglu, A. Cozzi, D. Koons, M. Flickner, D. Zotkin, Y. Yacoob, Attentive toys, in: Proc. of ICME 2001, Japan, pp. 11241127. [8] A. Haro, I. Essa, M. Flickner, Detecting and tracking eyes by using their physiological properties, in: Proc. Conf. on Computer Vision and Pattern Recognition, June 2000. [9] E. Hjelm, B.K. Low, Face detection: a survey, Comput. Vision Image Understand. 83 (3) (2001) 236 274. [10] A. Hyrskykari, Gaze control as an input device, in: Proc. of ACHCI97, University of Tampere, 1997, pp. 2227. [11] T. Hutchinson, K. Reichert, L. Frey, Humancomputer interaction using eye-gaze input, IEEE Trans. Syst., Man, Cybernet. 19 (1989) 15271533. [12] R.J.K. Jacob, The use of eye movements in humancomputer interaction techniques: what you look at is what you get, ACM Trans. Inform. Syst. 9 (3) (1991) 152169. [13] A. Kapoor, R.W. Picard, Real-time, fully automatic upper facial feature tracking, in: Proc. 5th Internat. Conf. on Automatic Face and Gesture Recognition, May 2002. [14] R. Kothari, J.L. Mitchell, Detection of eye locations in unconstrained visual images, in: Proc. ICPR, vol. I, Lausanne, Switzerland, September 1996, pp. 519522.

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104123

123

[15] C. Morimoto, D. Koons, A. Amir, M. Flickner, Pupil Detection and Tracking Using Multiple Light Sources, Workshop on Advances in Facial Image Analysis and Recognition Technology, Fifth European Conference on Computer Vision (ECCV 98), Freiburg, June 1998. [16] C. Morimoto, D. Koons, A. Amir, M. Flickner, Pupil detection and tracking using multiple light sources, Image Vision Comput. 18 (4) (2000) 331336. [17] K. Nguyen, C. Wagner, D. Koons, M. Flickner, Dierences in the infrared bright pupil response of human eyes, in: Proc. Symp. on ETRA 2002: Eye Tracking Research and Applications Symposium, New Orleans, Louisiana, 2002. [18] K. Perlin, S. Paxia, J.S. Kollin, An autostereoscopic display, in: Proc. of the 27th Annu. Conf. on Computer Graphics and Interactive Techniques, July 2000, pp. 319326. [19] D.D. Salvucci, J.R. Anderson, Intelligent gaze-added interfaces, in: Proc. of ACM CHI 2000, New York, 2000, pp. 265272. [20] A. Sangiovanni-Vincentelli, Dening platform-based design, EE Design, March 5, 2002. [21] L. Stark, StephenEllis, Scanpaths revisited: cognitive models direct active looking, in: D.F. Fisher, R.A. Monty, J.W. Senders (Eds.), Eye Movements: Cognition and Visual Perception, Erlbaum, Hillside, NJ, 1981. [22] R. Stiefelhagen, J. Yang, A. Waibel, A model-based gaze tracking system, in: Proc. of the Joint Symp. on Intelligence and Systems, Washington, DC, 1996. [23] A. Tomono, M. Iida, Y. Kobayashi, A TV camera system which extracts feature points for noncontact eye movement detection, in: Proc. of the SPIE Optics, Illumination, and Image Sensing for Machine Vision IV, vol. 1194, 1989, pp. 212. [24] R. Vertegaal, I. Weevers, C. Soln, Queens Univ., Canada Gaze-2: an attentive video conferencing system, in: Proc. CHI-02, Minneapolis, 2002, pp. 736737. [25] L. Young, D. Sheena, Methods and designs: survey of eye movement recording methods, Behav. Res. Methods Instrum. 7 (5) (1975) 397429. [26] S. Zhai, C. Morimoto, S. Ihde, Manual and gaze input cascaded (MAGIC) pointing, in: Proc. ACM CHI-99, Pittsburgh, 1520 May 1999, pp. 246253. [27] W.Y. Zhao, R. Chellappa, A. Rosenfeld, P.J. Phillips, Face recognition: a literature survey, UMD CfAR Technical Report CAR-TR-948, 2000. [28] L. Zimet, S. Kao, A. Amir, A. Sangiovanni-Vincentelli, A real time embedded system for an eye detection sensor, in: Proc. of Computer Vision And Pattern Recognition CVPR-2003 Demo Session, Wisconsin, June 1622 2003.