Ever since the advent of the still camera, the capture and replay of moving images has been an obsession for many. However, in terms of consumer consumption, this realm was dominated mostly by commercial cinema and television. Then it all changed a few years ago. By combining Internet sharing services that could host video content with the ability to capture video clips on mobile phones, video production and consumption proceeded to explode.

To cite one example, the upload rate to YouTube apparently now exceeds 24 hours of new content per minute. This is driven partly by a social trend, and partly by mobile phones themselves. People generally only take digital still cameras and camcorders to events where they think they’re likely to capture images, such as a wedding or a gig.

Mobile phones are carried all the time and, hence, are available for spontaneous use. Such spontaneously captured content now dominates the homemade video space. To wit, worldwide camcorder sales in 2010 expect to come in at around 20 million shipments, while camera-phone sales will reach approximately 1.2 billion.

Mobile-Phone Cameras

Cameras found in mobile phones (camera phones) differ dramatically from those used in digital still cameras and camcorders. Although they perform the exact same function and operate in the same manner, the technology nuances are driven by the form factor of the housing.

Thin is in for mobile phones, which places severe constraints on the optics design and power availability, the latter due to battery size. All things being equal, the quality of an imaging system is dictated by the lens diameter and imager resolution. Larger lenses result in taller camera modules, while higher-resolution sensors require more power than imagers of lower resolution. It also takes more processor cycles and, thus, battery resources to manipulate larger images.

Continue on next page

In recent years, the “megapixel race” has overshadowed digital image quality. The relationship between megapixels, perceived image quality, and price can be traced back to when the first solid-state imagers were introduced to the market.

Moving from a common intermediate format (CIF) resolution camera to a video graphics array (VGA) format provides three times the number of pixels and a very noticeable improvement in picture quality. The logic then follows that an 8-Mpixel camera should offer better quality than a 3-Mpixel camera, and that greater amount of megapixels becomes easier to market.

However, the constraints on camera module height in mobile phones mean that beyond a few megapixels, the potential benefits of high-resolution imagers are rapidly lost. Consumers are now recognizing this fact, causing megapixel count to stabilize at around 2 to 3 Mpixels for low-end camera phones and 5 to 8 Mpixels for high-end versions. Thanks to electronic reconfiguration, imagers of this resolution range can capture high-definition (HD) video with the benefit of an over-specified lens train so the raw image quality is generally excellent.

Video Quality

The wide viewing audience of Internet video no doubt separates it from previous generations of homemade film and video. Therefore, quality matters. The quality of a video image can be divided into two parameters: the raw attributes of the image, and the aesthetics (or artistry) of the composition.

Raw image quality is easily defined and measured. It includes parameters such as resolution, frame rate, distortion, color balance, and saturation. As with digital still cameras and TV screens, perceived image quality banks heavily on resolution.

Continue on next page

Most consumers aspire for HD video at 30 or more frames/s. An HD video has roughly the equivalent resolution of a 2-Mpixel digital still camera picture. As a result, HD video resolution can capture all aspects of a scene in adequate detail for most purposes.

Unfortunately, this often includes some details we would sometimes prefer not to see. HD video is particularly good at revealing skin blemishes and flaws, spots, poor teeth, and eye color, as well as a host of other personal features that most consumers would prefer remain hidden.

Though the camera can generally be relied upon to acquire a good quality image, how the camera is used (such as angle) might be considered solely the user’s responsibility. This isn’t entirely the case. For example, camera shake on the captured footage can be visually very disturbing. It’s exaggerated by zoom and ideally needs to be suppressed. The film The Blair Witch Project is a rare example of its deliberate use (to many negative reviews, too).

Camera shake on mobile phones tends to be particularly bad simply because the way the devices are held during video capture, namely in one hand of a partly extended arm. Meanwhile, digital still cameras are ergonomically designed to be held in two hands, while camcorders are held in a crooked arm, both of which are far more stable platforms.

Moreover, discrete hardware components help to address the shaking problem on both digital cameras and camcorders. Unfortunately, these are expensive solutions and often too bulky and consume too much power for use on a small form-factor device like a mobile phone with its limited battery capacity.

Continue on next page

Because camera shake and content flaws can’t be corrected at the source, there’s a need to remove them from the video stream either during or after capture. At the same time various filters and other effects can be applied, it’s possible to crop or move scenes, and/or edit the soundtrack. This is relatively simple on a fast desktop computer that features plenty of storage capacity, and many suitable commercial and freeware programs are available.

A mobile platform becomes a much more challenging situation. Not only does the processor have limited ability, but many CPU cycles are simply unavailable because they need to support the background phone functions. Also, the memory bus is unlikely to be either fast or wide (i.e., the memory is restricted in size) and the user input interface might not even include a keyboard. For this reason, corrections and effects (as many as possible) must be accomplished automatically and in a manner that doesn’t require processing of individual frames.

Automated Video Tools

So, the question is how to deploy a rich range of video-processing applications on small devices, all of which are running in real time, and process 30-frame/s HD video. One solution is to process captured video in real time by using ancillary information to simplify the task of video processing.

For example, by measuring and recording a vector displacement between consecutive frames, unintentional motion induced by hand movement (e.g., camera shake) can be easily rectified without the need to process individual video frames. This means it’s possible to implement applications such as video image stabilization on devices with very limited system resources, yet still dramatically improve the perceived quality of the recorded video.

Continue on next page

The algorithm used to determine the vector displacement needs to be reasonably sophisticated since it must correct for camera shake, while preserving camera panning and/or the motion of objects within the frame. Fortunately, camera shake is a relatively high-frequency jitter compared with panning and object motion. Therefore, objects can be distinguished by the application of filters (Fig. 1). Other frame-to-frame artefacts like dynamic range, saturation, and color balance, as well as visual effects like diffusion and sepia tones, can be corrected or applied in the same way.

Far more challenging to correct are the aforementioned person-related defects. The processing limitations of mobile platforms make some shortcuts necessary. Take, for example, the fact that humans are very face-centric. If an image contains a face, we lock onto that face first. Studies show that provided the face is well presented—in focus, and properly exposed and colored—then we’re easily satisfied with the whole image. This means only the pixels occupied by the principal faces in the image require processing, which is obviously far less intensive than the complete frame.

The first challenge of face-based imaging is to identify the faces in the scene. Mathematically this isn’t a trivial exercise, not least because of the diversity of faces, further complicated by profile, glasses, hats, earrings, and other factors. In this case, the shortcut is not to try and identify the faces in each frame, but rather track the location of faces from frame to frame, which reduces the data handling to a simple vector.

Once the faces are located, many interesting features become possible. Skin tones can be smoothed. Flaws like wrinkles, spots, and freckles eliminated or toned down. And, on the whimsical side, faces can be morphed and warped (Fig. 2).

Continue on next page

One of the benefits of face identification in a video stream is that a face provides orientation data, offering a simple solution for auto-rotation of handsets without requiring additional hardware. Another benefit is auto focus. As mentioned, images are perceived to look better if the faces are in focus. Therefore, by restricting the data fed to the auto-focus algorithm to just the principal face, it can be made to run faster. As a result, less power is consumed, and video quality is improved.

Face-based imaging also offers the potential of face recognition. It’s now possible to automatically tag pictures taken by digital still cameras and video files with the people they contain as well as other aspects of the scene, such as whether the background features a beach or mountains. Video that’s electronically catalogued by content opens new ways of searching as well as sharing and interacting on social networking sites, and these methods are just beginning to be explored.

Face-recognition algorithms tend to use one of two approaches. One describes faces by identifying key features and determining the spatial relationships between them. For example, an algorithm may analyze the relative positions and size of the eyes, nose, cheekbones, and jaw. A subtly different approach involves identifying and mathematically describing distinctive features, such the contour of eye sockets and the nose and chin shape, as well as unique lines and spots.

Both produce a small mathematical description that’s used to search a database for closely matching values. Note that face recognition is very different from face identification, which is a topic of interest to law-enforcement and border-control agencies, among other organizations.

Video image capture need not necessarily result in a video output. One recent innovation uses a live video stream to create a large still picture by stitching the frames together. A common application is to create panoramic photographs, although the stitching can also be vertical or both axes can be scanned to produce an oversize image.

Continue on next page

Video frame stitching requires on-the-fly pixel merging together with some intelligent decision-making inputs. That’s because moving items (e.g., cars in the background) need to be frozen and the aspect ratio of objects in the foreground must be adjusted to remain in the proportion relative to those in the background.

Mobile-Platform Implementations

Several options are available for implementing automatic video processing on a mobile platform. A purely software approach is seldom satisfactory, though, due to the platform resources in terms of available computing power, memory size, and bus speeds. Some video processing can be accomplished using software only, but many of the more desirable and challenging features are incompatible with this method.

A hardware-only solution is highly effective thanks to incredibly fast hardwired computation. However, hardware has a very long design, development, and production time scale, and the resulting product requires both precious real estate and power.

Occasionally, engineers will compromise by using general-purpose processors or engines designed for image processing. This approach has the advantage of ready availability and a much shorter design cycle, but it is generally costly and power hungry. Thus it’s only justifiable on high-end imaging systems.

Seemingly, the best mobile-device solution takes from both sides of the equation, where part of the video tool is implemented in hardware and the remainder in software (Fig. 3). With careful design, this solution can be made compatible with many system architectures.

Underlying such an approach is the fact that many algorithms required for automated video tools require common inputs to function. For example, most software that manipulates images needs a measure of the gray-scale range in every frame. Because this data is a key input to many subsequent calculations and is computed frequently, considerable efficiencies in code size and execution speed are possible by implementing in hardware the sub-routine that calculates gray-scale range.

In this divided hardware-software approach, sometimes referred to as hardware acceleration, the hardware portion derives fundamental metrics common to all algorithms. These are then reused among the software portion of the video tools installed on each platform.

Even on a high-end smart phone, performance gains made via hardware acceleration are dramatic. In fact, execution speeds can reach some 100 times faster than the same algorithms being run in software only. Moreover, the hardware is common to many devices, yet the software can be easily changed or upgraded at a later date. Overall, hardware acceleration delivers the best performance and reduces the cost of ownership.