A General Probabilistic Framework for Volumetric Articulated Body Pose Estimation and Driver Gesture, Activity and Intent Analysis for Human-Centric Driver Assistance
Shinko Y. Cheng
Computer Vision and Robotics Research Laboratory
University of California, San Diego
Dissertation Research Advisor
Prof. Mohan M. Trivedi
Human Gestures are said to be as rich and varied as spoken language itself, manifesting from the physical human body as ideas, interests, feelings and intentions. There is much to be gained from a thorough understanding of gesture and pose in improving the efficiency of information flow between the ever-ubiquitous computer and its human operator. Such an understanding allows the operator for more natural methods to generate commands. A unique quality of gestures is how some gestures can be unconsciously \uttered” in the course of human thought. These thoughts, these interests, feelings and intentions, whether made consciously or otherwise, could conceivably be inferred by observing the pose of the human body.
pose –> gesture –> desire, intent, activity
In this thesis, we investigate technologies that enable intelligent systems to recognize human desires and intents. We devise systems that automatically recover human pose and gesture information with emphasis on applications that improve the safety and comfort of vehicles.
A set of human gestures that have gained much attention from researchers of intelligent vehicles are driving gestures. An understanding of these human gestures has significant implications for the safety and comfort of an activity that nearly all of us partake everyday. Consider that 2.3 million vehicle crashes and over 43,000 fatalities are expected to occur on U.S. roads this year. Driver inattention was determined to be the contributing factor in an estimated 32% of all collisions between 2002 and 2003. Inattention is characterized by drivers performing tasks secondary to driving, including accessing a wireless device, activities involving other passengers, personal hygiene, dining, and other distractions from within the vehicle. Driver gesture recognition would enable the vehicle’s computer to automatically determine whether the driver is being inattentive in nearly 700,000 of the collision situations expected to occur. Upon determining that a driver was inattentive, the computer can be tasked with co-pilot-like responsibilities to assist and warn the driver upon detecting dangerous situations, thus making the vehicle safer to drive.
An example of how gesture recognition may help, imagine the scenario in which a motorist turns right at an intersection with a bicyclist in his blind spot. If the vehicle could detect the bicyclist in its path, it could alert the driver and potentially avoid the collision. We can see how driver gestures recognition is a critical element in the safety system when we realize that had the driver already seen the bicyclist, an alert might hinder rather than complement safe driving by unnecessarily increasing the driver’s workload. If the vehicle was sensitive to this fact and was then able to sense that the driver indeed failed to notice the bicyclist, it could then more confidently alert the driver of the danger or even take over. For an intelligent driver-assistance system to be effective in intervening in dangerous situations, it must be able to continuously monitor not just the surrounding environment and vehicle state, but also monitor the driver’s gestures. Despite the relatively confined seat where the driver operates, there are still numerous activities a driver may perform in the course of normal driving. We focus our efforts on two: approaching intersections and accessing the infotainment system. Both kinds of gestures are important for driving safety. Intersection-turn intent can generate more accurate warnings in collision situations as our bicyclist example showed. And user determination helps drivers focus on the primary driving task by selectively restricting their access during critical driving situations to the increasingly popular vehicle infotainment system, all the while allowing the passenger full access.
For each gesture, there are two main challenges: 1) characterize the gestures that indicate the onset or presence of the activity, and 2) develop the computer vision and pattern recognition techniques that automatically extract these gestures. The two challenges could be addressed together if the measurement devices were straightforward to construct. In the case of user-determination, the measurement device is the extraction module being developed. For the case of intersection turns, the measurement device is complicated and itself a long-term research problem; a different measurement device had to be used. We report on our characterization of the intersection-turn and other intersection approaches. In order to automatically recognize these driver gestures and gestures in general, the system requires some description of the driver’s body pose. Depending on the needs of the application, the description can explicitly describe the skeletal configuration of the driver’s body, i.e. joint position, joint type, body part orientation, or be simplified to just describe position or appearance of body parts. Nearly two decades of research in estimating the pose of the articulated human bodies from images have produced promising results, but none sufficiently addresses the requirements of the vehicle environment. For vision sensors, external lighting conditions affects the appearance of the captured images and therefore require a processing algorithm that is invariant or robust to such illumination changes. Therefore, as part of the second challenge, we set out to understand the type of descriptions that are optimal towards the driver gesture recognition task in the vehicular environment. We devised three techniques ranging from the highest-fidelity pose estimates but fragile to lower-fidelity but most robust: One is based on volumetric reconstructions of the articulated body. A second is an image-based hand detection and tracking system using long-wavelength infrared imagery. Finally, an image-based hand presence and hand identity (passenger/driver) recognition system using visible and near-infrared images.
1. The thesis begins with the results of our investigation into pose estimation. We turned our attention towards using 3-D volumetric reconstructions from subject images as the means to extract subject pose. A collection of algorithms that reconstructs the 3-D structure of the scene from captured images of multiple views of the subject, shape-from-X algorithms have gathered much attention in recent years. One of the unique features is that 3-D reconstructions are consistent; that is to say there exists reconstruction techniques that are robust to illumination changes, which is very important in the vehicle environment. Furthermore, the detail provided by volumetric reconstruction can be highly detailed by exploiting the number views and the numerous image features like texture, color, edges. The problem that remains is devising a way that utilizes these reconstructions to estimate subject pose.
We present a novel method for learning and tracking the pose of an articulated body by observing only its volumetric reconstruction. The model is called the kinematically constrained Gaussian mixture model (kc-gmm). Pairs of components are connected at a joint are encouraged to assume a particular spatial configuration, forming joints with 1, 2, and 3 degrees-of-freedom (DOF). Pose estimate are learned using the EM algorithm. The novelty is in combining in a principled manner previously separate iterative steps of segmenting the volumetric reconstruction, association of segments to body parts, and body part pose estimation using this statistical framework. The approach is also the first to be evaluated using a publicly available common human image data-set with optical motion capture ground-truth. The data-set provides an opportunity to quantitatively compare different approaches, and this approach is the first volume-based approach to yield quantitatively comparable results. The algorithm achieved competitive estimates with mean joint position error of 15.9 cm, or 8% of the total length of the body. On synthesized hand data, the error was 0.5 cm, or 1.5% of the total length. [IEEE ICVS 2006, IEEE CVPR 2007 EHum2 Workshop Best Paper]
2. We next turned our attention towards the characterization and recognition of driver intent with driver gestural cues. The driving activity is the intersection-approach, and the cues to recognize driver intent consists of driver and vehicular signals. Driver signals consist of left and right hand position, head position and head orientation in 3-D from a highly accurate but finally impractical marker-based optical motion capture system. Vehicle signals includes speed, throttle, braking, steering angle, and turn signals. We collected this data over 4 hours drive with 258 intersections. Among the types of intersection approaches, we identified the “slow” intersection-turn maneuvers to have very consistent attributes.
We present a classification system and answer the question, to what extent does driver bodily movement contribute to the prediction of driver intersection-turn maneuvers. The classification system operates on a window of these signals over time like a causal filter. The system is trained to provide a high response when the pattern of driver movements and vehicle states closely matches the pattern of an impending intersection-turn. The system utilizes the kernel-Relevance vector machine as the statistical model. The system is able to predict whether a driver will attempt a left or right hand intersection turn before the vehicle enters the intersection with a true-positive rate of 80% with 7% false-positive rate, and 100% with 20% false-positive rate. We present results comparing different combinations of input cues as well. The analysis from that experiment concluded that turn-signals do not contribute to the recognition of an impending intersection turn in the presence of these other signals. We also proposed a novel set of performance metrics for this and such systems: The ROC Area vs. Decision time, and Average Response Over Time plots. These plots quantify the portion of turns that the algorithm can actually predict 500, 1000, 1500ms before the vehicle entered the intersection. The concepts are generic and apply towards the study of other driving maneuvers as well. [IEEE Pervasive Computing 2007]
3. The driver-intent recognition algorithm assumes the use of body part position information. To address this requirement, we present an in-vehicle system for detecting and tracking the position of the left and right hands in long-wavelength infrared imagery. The novelty of this approach is in the use of infrared imagery. Skin emits a relatively constant amount of infrared radiation, appearing relatively constant in radiometric infrared cameras. This has increased the similarity among the images of hands used to train the object detector (based on haar-wavelet like features and adaBoost). The detections were tracked with Kalman tracker, and the left and right hands were disambiguated using expected hand location constraints. The results were effective in tracking left and right hands over 90 minutes of driving. Combined with steering information, 5 hand activities over the steering wheel could also be determined. [CVIU 2007]
4. Finally, we present an in-vehicle system for determining which occupant is accessing the vehicle infotainment controls for modulating information ow from the vehicle’s information display. A camera is placed over occupants in the front-row seat area. Using histograms of oriented gradients image feature to describe the area over the controls, a support-vector-machine determined if the observed image patch contained the hand of the driver, the passenger or no-one. The average correct classification rate of 97.8% was achieved in real-time over 60 minutes video at 30 frames-per-second under a rigorous set of moving vehicle operating conditions, including night time, under various directions of the sun and overcast conditions. The most intriguing result from this final investigation is that despite the amount of variation in appearance of the hands over the controls as a result of illumination change, such high rates of correct classification can be achieved. This system effectively traded fewer classes for more variance in the data. [IEEE IV 2008, IEEE ITS 2010]
The central theme of this dissertation has been towards enabling intelligent vehicles with the ability to recognize human desires and intents for improving the safety and comfort of vehicles. We have made pathways in articulated body pose estimation, driver intersection-turn intent recognition, in-vehicle hand tracking using thermal infrared imagery, and finally in-vehicle vision-based user determination using hand images for infotainment control safety. We conclude with many interesting new systems and questions answered, but there is still much to be learned and problems to be solved. A list of future research directions are described in the dissertation. There is supplemental discussion on the investigations into the Active Heads-up Display for providing visual feedback to the driver in an augmented reality fashion [IEEE Computer 2006], and the specifications of the LISA-P test-bed from which all on-road human subject data was collected. With the presented concepts, algorithms, and systems for the automatic recovery of human pose, gesture, and intent information, we hope to have strengthened the belief of the reader regarding the significance of enabling an intelligent vehicle with the ability to know what we are going to do in critical situations and how within reach that capability has become.
Last Modified: 2008.07.08