Detecting objects in video: a comprehensive guide for business

Deloitte’s experts estimate that digital lean transformation can lead to savings of over US$20 million yearly. A crucial part of transforming artificial intelligence (AI) systems, computer vision services allow businesses to be more efficient while the machine works independently. Perhaps the most recognizable part of such vision services is AI image recognition, the basis for detecting objects in video. This enables the machine to locate, identify and classify objects.

What is video object detection?

Object detection is often called object recognition, object identification, or image detection. The aim of object detection, as a field of machine learning, is to develop computational models that provide the most basic information needed by computer vision applications. In other words, it asks, “What objects are where?”

Video Object Detection (VOD) mimics the human visual cortex. It allows machines to analyze video frame by frame and identify the objects present within them. Thus, object detection in video works similarly to AI image recognition. Such a tool aims to locate and identify objects seen on input moving images.

Object detection allocates instances of visual objects into certain classes (for example, humans, animals, cars, or buildings).

Object detection vs. Object tracking

When performing object detection only, you don’t know the relation between objects in 2 different frames. In contrast, you can recognize the same object in 2 different frames via object tracking, helping you, for example, to count vehicles, etc., because you know when and what specific objects showed and disappeared.

How does video object detection work?

As with every AI project, video object detection systems go through the stages of the AI life cycle. First, you establish your business goals and consult with computer vision experts to determine if your ideas are feasible. Supposing that VOD is a part of this solution, you would need to dive into the technical part. So, to train a machine to detect entities in a video (like any other AI task), you have to:

Collect data
First, we need to create a library of video footage, preferably using similar hardware and setting as for the actual use case. When the dataset has been collected, the machine learning team, in collaboration with the business side, will create guidelines for labeling the objects of interest.
Label data
In the labeling stage, labelers manually draw bounding boxes around the entities we want the AI to identify, aiming to teach the machine to recognize the objects. These two steps, data collection and labeling, are crucial for correct operations. Investing time, human resources, and money to obtain the best-labeled dataset possible is vital. This process can be assisted by clever solutions that allow the labelers to label the objects in a single frame and have the machine suggest the same object in consecutive frames by using object tracking methods.
Train a model
The algorithm is then trained on the labeled data. This step is critical to ensure adequate accuracy. After this training phase, the performance of the model should be evaluated and validated on a dataset that was not used to train the machine. This allows us to estimate the model’s performance in a real-world setting with unseen data. Without this dataset experience, partitioning the dataset and drawing conclusions about its likely performance in a real-world setting might be hard to get right.
Test the model
The model goes through real-life testing to determine whether the task is done appropriately and whether there is worth in proceeding. Then, the improvements happen based on feedback resulting from running different experiments. The testing ensures that you have the optimal solution with no missed edge cases and reduces the risk of making a significant investment in a poorly working system.

As with image recognition, you can use ready-made solutions on various platforms or implement off-the-shelf solutions yourself. But platforms and off-the-shelf systems can’t handle all use cases as they are. If your data are domain-specific and/or you need to detect something the platforms aren’t trained to do, then a custom solution may have to be built to get the required accuracy.

How is VOD used in today’s business?

As mentioned above, object detection is one of the fundamental elements and challenges of computer vision. In other words, it forms the basis of other downstream computer vision tasks, for instance, object tracking in video. Specific object detection use cases include, for instance, number-plate recognition; people counting; and pedestrian, face, text, and pose detection.

Video surveillance

Video surveillance systems, empowered with object-tracking models, can enhance security and efficiency. For example, retailers equip their shops and warehouses with cameras to warn security in case of theft or other criminal behavior. Moreover, computer vision in manufacturing uses video processing to monitor the safety and performance of their employees, checking for possible collisions or people entering dangerous zones on a factory floor or in a warehouse.

Computer vision system is detecting objects in video to count boxes in a warehouse.

Face detection

Apart from the controversial use case of face detection in the public sector, there are many other applications of this technology. For example, looking at your camera is enough to navigate on your phone, thanks to real-time object detection in video and face recognition. Likewise, trying a new haircut via some apps like Snapchat or Instagram also works with filters powered by VOD systems.

The girl is trying a new haircut and hair colour in the app, empowered with object detection in video.

Data analysis

Object detection models allow manufacturers to automatically count products, parts, and boxes in any video file. Afterward, managers can improve the monitored manufacturing processes. Event makers can manage capacity limits and crowd safety in real time via the automated counting of people entering or exiting. This application allows the reliable measuring of foot and vehicle traffic to determine both visitation and visitation patterns over time.

By tracking the movement of people in sports, coaches can pay attention to the weaknesses of a player’s performance. In science, monitoring and marking down bacteria movements could be done automatically.

Traffic analysis

Video object detection automatically identifies and counts vehicles entering and exiting particular scenes, roadways, and car parks — thus, enabling people to avoid traffic jams or full parking. For example, to assess traffic volume, the Estonian transport administration counts vehicles and uses such data for road design, construction, and maintenance.

Contactless checkout

Just-walk-out shops are often considered the future of retail. For example, Amazon, using a combination of AI, computer vision models that detect objects, and data pulled from multiple sensors, ensures that customers are only charged for the goods they pick up by fixing and tracking items as they’re taken from shelves.

Text detection

Automatic number plate recognition is the software trained to understand different license plates, including an emphasis on the local standards. It accurately and quickly – just in seconds! – detects symbols and digits on input video. Hence, it can enable contactless access, opening gates automatically for authorized vehicles. The police also use it to locate vehicles they are tracking. For example, the police force can catch an offender in a few hours by identifying the car by its number plate caught on a roadside camera. The roadside camera analysis can still be manual, but number detection is automated.

Benefits of detecting objects in video

Whether applying for one feature or a fully upgraded business process, custom computer vision solutions can significantly increase efficiency while decreasing human agents’ efforts and reducing interventions.

Enhanced security

Multiple security applications in video surveillance are based on object detection. The machine can detect suspicious objects or violations of a specific uniform policy for dangerous manufacturing processes, such as spotting people in restricted areas or automating inspection tasks in remote locations with computer vision.

Quality control

Every deviation from a “golden standard” may be flagged for a further human check. For example, computer vision detects and AI measures how long every single manufacturing cycle takes. If those periods significantly differ from the “golden standard,” it may indicate mistakes in the process. AI then flags those particular cycles for a human supervisor, who can double-check the video to validate or fix the operation.

Efficiency growth

Not only counting products and boxes influence efficiency growth. For example, object detection is used in agriculture for counting and monitoring animals to provide better care, hence better business results.

Faster and better decision-making

Object detection has resulted in many breakthroughs in the medical community. For instance, as medical diagnostics heavily rely on images, scans, and photographs, video object detection has become extremely useful for diagnosing diseases. Among the many computer vision application examples is video pattern recognition, which can identify any repetitions or asymmetries in a newborn’s sleeping movements. If these patterns appear, it could be symptomatic of a problem. Consequently, computer vision technology improves decisions made regarding treatment.

Watch out for video object detection pitfalls

A well-trained machine vision system can identify the existence of similar objects in new data within certain parameters of tolerance. However, poor data selection, inappropriate model architecture, training process, or other factors can decrease the algorithm’s efficiency dramatically due to ‘unseen’ input (i.e., data similar to but not identical to the training dataset).

Occlusion and reacquisition

If video object detection is performed by splitting a video into frames and applying an image recognition algorithm to each frame, issues with the visual data could worsen the performance of the entire object detection model.

For example, occlusion and reacquisition challenge efficiency. If video object detection recognizes a human who then briefly becomes “invisible” because of a pedestrian passing by, the system will need time to figure out that it has “lost” the subject and initialize reacquisition, thus making video detection slower and less effective.

Motion blur

However, a subject can be lost not only because of occlusion but also of a frame disruption caused by camera or subject movement (or both). Moving objects can produce motion blur, which causes a video to defocus, challenging the ability of the recognition framework to identify objects in the video.

Depending on the model functions, whether it analyzes the video frames separately or generalizes all the images available, video detection could identify the items even with motion blur.

Motion blur example, disturbing the process of detecting objects in video.

For the system just parsing frames, such understanding is impossible because each frame is treated as a complete (and settled) episode. Moreover, repeatedly registering the same object on a per-frame basis is resource-demanding.

But object tracking, in some cases, allows full picture restoration, and, besides, it demands less computational power to identify and track a subject since, from frame #2 onwards, the system knows what it is looking for.

Avoiding Overfitting

Analyzing moving pictures might not be as easy as analyzing still images. Overfitting happens when the model fits too closely to irrelevant details in the training dataset – hence the name. It can occur when the training video is different from the data generated in the actual use case scenario or if you have too little data for training, causing the machine to learn patterns that aren’t descriptive of what you’re looking for.

For example, you’re training the model to recognize cars, but all the vehicles in the training video data are trucks. As a result, it could easily overfit the shape of an object and not recognize regular passenger cars as cars. The same could happen with color if you’re training it to identify people, but all people in the training video file are wearing bright blue.

To tackle this problem at the core, remember the first step of making an object detection model: data collection. The more diverse the training data are, the less overfitting will occur. The dataset could be enhanced, for example, for video detection used in monotonous video surveillance, when the same objects constantly come and go. Solutions, including regularization, data augmentation, and collecting better data sets, can prevent this overfitting phenomenon from happening.

Real-time video object detection

Current industry imperatives have moved towards the development of real-time detection systems, with offline frameworks effectively migrating to a ‘dataset development’ role for low- or zero-latency detection systems.

Some types of neural networks are well-suited to video object detection. In 2017, a collaboration between the Georgia Institute of Technology (GIT) and Google Research offered a video object detection method that achieved an inference speed of up to 15 frames per second operating on a mobile CPU.

However, the iconic field example is YOLO (You Only Look Once), a state-of-the-art real-time object detection system. The latest version, the YOLOv7 model, is able to detect even very far-off and small objects in a video. Nevertheless, this ready-made model might not be enough for business problem solving, even though YOLOv7 is the first in the YOLO family to “read” human movement.

Best practice for video object detection

Training data for object-detecting models should be as close as possible to data the machine will process in real life. For example, if the use case supposedly produces low-resolution and blurry videos (e.g., CCTV footage), the data for AI training should also feature videos with these qualities. In general, training video footage should also include multiple angles and resolutions.
The more different objects you want to recognize, the more data you will need.
If the objects to identify have a high variance (e.g., many different cars worldwide), you should also ensure that the data you collect are varied. Machines tend to learn patterns very well. Thus, if object detection only sees some types of an object, it will not necessarily recognize a significantly different instance.
Teaching a machine to recognize the minute differences between similar entities requests more data on each object.
Invest in labeling. The quality of labeling can make or break a machine vision system, and, most of the time, quality beats quantity. If this means you have to label everything twice to ensure everything is done correctly, that is a good investment.
Start small. If you aim to recognize 100 different objects, but it is possible to start with 10, then do it. You invest less in data collection and labeling, and you can quickly get an idea of the accuracies you can achieve and the possible complexities in collecting the data you need.
Ward off data leakage. During model training, the object-detecting algorithm might use biased information, which will be unavailable during real-life predictions. It corrupts the results, performing over-optimistically during training, validation, and test stages but poorly in making predictions for unseen data. Some leakage examples result from a bias caused by camera viewing angle or light conditions (morning/evening) or based upon videos with certain labels from specific regions, language groups, commentators, et cetera.

Using Video Object Detection is becoming increasingly important in today’s world. However, AI researchers are still working hard on making this technology more accurate and closer to how human beings would analyze their environment.

Having delivered over 80 artificial intelligence projects involving cutting-edge computer vision applications, we are sure that AI-enhanced data processing provides considerable efficiency, productivity, and profitability benefits.

MindTitan builds tailored-made computer vision solutions that help to solve complicated business problems when ready solutions cannot help or when complex integration with other AI models is required.

Go back

Detecting objects in video: a comprehensive guide 2024