How Apple Uses AI To Produce Better Photos

Smartphone photography has advanced by leaps and bounds, driven first by miniaturization and incredible advances in camera sensors and lenses, and in more recent years, by rapid advancements in AI technology.

The need to support computational photography explains why AI capabilities are increasingly making their way into smartphones, such as the Tensor processor in the Pixel 6, the Apple Neural Engine (ANE) in iPhones and iPads, and neural processing units in smartphones by Samsung and Huawei.

In a recent blog post, the ML team at Apple offered a glimpse of how the Camera app on iOS and iPadOS uses AI to create better photos. The unattributed post highlighted the technical details of how Apple developed a new neural architecture to perform image segmentation that is sufficiently compact and efficient to run on-device for minimal impact on battery life.

Understanding image segmentation

A vital aspect of AI in photography revolves around obtaining a pixel-level understanding of each image. Indeed, the Camera app on iPhone and iPad devices rely on scene-understanding technologies to develop images, explained the author. For example, person segmentation and depth estimation are required to deliver capabilities such as Portrait Mode, while several other features consume image segmentation as an essential input.

In addition, person segmentation together with skin segmentation enables the app to recognize group shots of up to four people, allowing it to optimize contrast, lighting, and even skin tones for each person individually. Similarly, sky segmentation and skin segmentation enable denoising and sharpening algorithms to improve image quality in certain parts of the photo.

For its recently released iPhone 13, Apple introduced a new iteration of its proprietary “Smart HDR” technology, called Smart HDR 4. The need to improve “color, contrast, and lighting for each subject in a group photo” and deliver better Night mode photos meant that the AI team had to go beyond scene-level segmentation.

The team decided to implement panoptic segmentation, which unifies scene-level and subject-level understanding by predicting two attributes for each pixel. One advantage to modeling both elements is greater efficiency as opposed to modeling them separately. Crucially, the author explained that elements predicted by panoptic segmentation can be scaled into the “hundreds” eventually.

Introducing HyperDETR

The Apple team chose the Detection Transformer (DETR) architecture as the baseline as it doesn’t require the postprocessing that most architecture needs, and is also highly efficient when evaluating regions of interest (Rols) – the latter being a neural-net layer used for object detection tasks.

Advantages of DETR aside, using it for panoptic segmentation introduced “significant computational complexity”. Indeed, an additional convolutional decoder module needed for panoptic segmentation became the dominant bottleneck at higher output resolutions. The team developed HyperDETR to mitigate this performance bottleneck.

“HyperDETR is a simple and effective architecture that integrates panoptic segmentation into the DETR framework in a modular and efficient manner… we completely decouple the convolutional decoder compute path from the Transformer compute path until the final layer of the network,” explained the author.

The HyperDETR network was trained on an internal dataset of around four million images with about 1,500 categorical labels. After further processing, additional training was performed using another 50,000 images that were annotated with extremely high-quality annotations of a handful of categories that can now be predicted completely on-device: sky, person, hair, skin, teeth, and glasses.

It is worth noting that images were randomly resized, cropped, oriented, and rotated to simulate poorly-oriented captures. Additional optimizations were made to reduce file size and memory footprint when run on ANE.

Better photographs

The work on HyperDETR was attributed to various Apple employees specializing in AI, and include Atila Orhon, Mads Joergensen, Morten Lillethorup, Jacob Vestergaard, and Vignesh Jagadeesh. You can read the original blog post here.

Advancements in computational photography continue to happen in leaps and bounds, with successive breakthroughs built on earlier concepts and architectures.

Indeed, the Apple author says the team was inspired by the idea of generating dynamic weights during inference from HyperNetworks, a meta-learning approach proposed by a trio of Google researchers in 2016. And the DETR architecture itself was proposed by researchers from Facebook AI in 2020.

For now, expect your photographs to get even better – even if you are not much of a shot with the camera.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/DragonImages