Low Latency Video - Image Signal Processor

Written 2021-08-17

Tags:VideoInputPort Video ImageSignalProcessor 

Once our video is digitized at the image sensor, the next stop is the input port(sometimes called Video Input Port or VIP) to the Image Signal Processor, or ISP. Both the VIP and ISP are often integrated components in a larger chip, as is the case for Raspberry Pi processors. The ISP's role is to convert the subpixels(often in Bayer RGGB) into a much more common format like YUV or RGB for each spatial sample, compute statistics for the current frame, as well as applying a number of other corrections and tunings(list represents an example ISP from "Architectural Analysis of a Baseline ISP Pipeline" by Hyun Sang Park):

Notably, most of the adjustments made here are calculated using statistics or factors generated from previous frames - only a few utilize information from neighboring pixels, and even then a 7x7 convolution uses only the current row, 7 pixels from each of 3 previous rows, and 7 pixels from each of 3 future rows - this means the ISP can, and probably should start processing either as pixels come in or in thin strips.

Common Pitfalls - VIP to ISP Integration

The VIP is often either tightly integrated into the ISP, or implemented as a simple DMA engine that reads pixels from the camera, buffers as needed, and writes into RAM. For either case, how often should the VIP trigger processing of the rest of the ISP? If the VIP waits until the frame is completely acquired before signalling for ISP processing to begin, latency of nearly 1 cameraFrameTransmissionTime is added to the oldest pixel from the camera, with an average of 0.5 cameraFrameTransmissionTimes. For 60FPS video, this is 8.3ms average and 16.7ms max.

Another approach is to signal the ISP each time a line or group of lines is ready, forward the raw pixels into the ISP, or integrate the VIP into the ISP, and perform all operations there, which allows ISP processing with much lower latency.

Common Pitfalls - ISP to Codec Integration

On the output side of the ISP, we have a similar latency problem to the input side. If the next stage (encoder or CPU) is signalled only after each frame is fully processed by the ISP, again the first pixel sent from the camera has waited under the ISP's control for as long as the ISP takes to compute a frame. If camera framerate and resolution utilize near the maximum ISP ability, this takes almost another frame time to compute.

Again, the solution is to signal the encoder more often perhaps at every macroblock-row boundary of 16 pixel height, so that the encoder does not need to wait in the same way.

Common Pitfalls - userspace API

If we don't connect the ISP directly to the encoder, the ISP driver usually instead presents an API to programs. Examples of this like V4L2 or UWP usually allow programs to block until a video frame is ready, then read or access it. This creates the same issue we can run into on the input to the VIP - frames are temporally long objects. Userspace APIs like V4L2 and UWP place a restriction of a minimum of 1 frame of buffering in this way.