Detecting objects in video:
a comprehensive guide 2022

Irina Kolesnikova
September 12th, 2022

The Titan detecting objects in video

An improvement of US$20 million yearly in earnings before interest, taxes, depreciation, and amortization (EBITDA), a decrease in costs by 15% per line per year, and improved general equipment effectiveness (OEE) by 11% yearly can result from the digital lean transformation, by experts Deloitte estimation. As a part of artificial intelligence (AI) systems, computer vision services allow businesses to be more efficient while the machine works independently. AI image recognition is an essential task of computer vision. It is a basis for video object detection, enabling the machine to locate, identify and classify objects. Thus, detecting objects in video could be profitable to your business.

What is video object detection?

Object detection is often called object recognition, object identification, and image detection – these concepts are synonymous. Object detection as a field of machine learning is to develop computational models that provide the most basic information needed by computer vision applications: “What objects are where?”.

Video Object Detection (or VOD), in a way, mimics the function of the human visual cortex and provides machines with the ability to analyze video frame by frame and identify the objects present within them. In other words, object detection in video works quite the same way AI image recognition does. Such a tool aims to locate and identify objects seen on input moving images.

Object detection allocates instances of visual objects to certain classes (for example, humans, animals, cars, or buildings).


Object detection vs. Object tracking

Performing object detection only, you don’t know the relation between objects in 2 different frames. By performing object tracking, you will know to recognize the same object in 2 different frames, helping you, for example, to count vehicles, etc., because you know that exact objects showed and disappeared.

How does video object detection work?

Like any other artificial intelligence project, video object detection systems go through the stages of the AI life cycle. First, you will understand your business goals and consult with computer vision experts if your ideas are feasible. But to clarify how VOD works, let’s dive into the technical part. So, to train a machine to detect entities in a video (like any other AI task), you have to:

  1. Collect data
    First, we need to create a library of video footage, preferably using similar hardware and setting as for the actual use case. When the dataset has been collected, the machine learning team, in collaboration with the business side, will create guidelines for labeling the objects of interest.
  2. Label data
    In the labeling stage, labelers manually draw bounding boxes around the entities we want the AI to identify, aiming to teach the machine to recognize the objects. These two steps, data collection and labeling, are crucial for correct operations; consider spending time, human resources, and money to obtain the best-labeled dataset possible.
    This process can be assisted by clever solutions that allow the labelers to label the objects in a single frame and have the machine suggest the same object in consecutive frames by using object tracking methods.
  3. Train a model
    The algorithm is then trained on the labeled data. This step is critical to ensure adequate accuracy. After this training phase, the performance of the model should be evaluated and validated on the part of the labeled data set that was not used to train the machine. This allows us to estimate the model’s performance in a real-world setting on unseen data. Depending on the data, partitioning the dataset and drawing conclusions about its likely performance in a real-world setting might be a complex task and hard to get right without experience.
  4. Test the model
    The model goes through real-life testing to determine whether the task is done in an appropriate way that worthing to proceed. Then, the improvements happen based on feedback resulting from running different experiments. The testing ensures that you have the optimal solution and not missed edge cases and reduces the risk of sacrificing significant investments before proving that the model works poorly.

Like with images, you can use ready-made solutions on various platforms or implement off-the-shelf solutions yourself. But platforms and off-the-shelf systems can’t handle all use cases as they are. It happens if your data are domain-specific and/or you need to detect something the platforms aren’t trained to do. In this case, a custom solution will often have to be built to get the required accuracy (with the help of experts in machine vision services).

How is VOD used in today’s business?

As mentioned above, object detection is one of the fundamental problems of computer vision. In other words, it forms the basis of other downstream machine vision tasks, for instance, object tracking in video. Use cases of specific object detection include, for instance, pedestrian detection, people counting, face detection, text detection, pose detection, and number-plate recognition.

Video surveillance

Video surveillance systems, empowered with object detection models, can enhance security and efficiency. For example, retailers equip their shops and warehouses with cameras to alarm security in case of thefts or incivilities. Whereas computer vision in manufacturing uses video processing to monitor the safety and performance of their employees, checking for possible collisions or people entering dangerous zones on a factory floor or warehouse.

Computer vision system is detecting objects in video to count boxes in a warehouse.

Face detection

Apart from the controversial use case of face detection in the public sector, there are many other applications of this technology. For example, looking at your camera is enough to navigate on your phone, thanks to the ability to detect objects in video in real-time and face recognition. Likewise, trying a new haircut via some applications also works with VOD systems.

Moreover, VOD also allows you to use filters on some apps like Snapchat or Instagram.

The girl is trying a new haircut and hair colour in the app, empowered with object detection in video.

Data analysis

Object detection models allow manufacturers to automatically count products, parts, and boxes in any video file. Afterward, managers can improve the monitored manufacturing processes. Event makers can manage capacity limits and crowd safety in real-time by automated counting of people entering or exiting venues.

This application allows reliable measuring of foot and vehicle traffic to determine visitation and visitation patterns over time. By tracking the movement of people in sports, coaches get an opportunity to pay attention to the weaknesses of the performance. In science, monitoring and marking down bacteria movements may sometimes be done automatically.

Traffic analysis

Video detection automatically identifies and counts vehicles entering and exiting particular locations to learn roadways and car parks — thus enabling humans to enhance usage, avoiding traffic jams or jam-packed parking. For example, to learn the traffic volume, the Estonian transport administration counts vehicles and uses such data later for road design, construction, and maintenance,

Contactless checkout

Just-walk-out shops are often considered the future of retail. For example, Amazon, using a combination of AI, computer vision models, which detect objects, and data pulled from multiple sensors, ensures that customers are only charged for the stuff they pick up by fixing and tracking items as they’re taken from shelves.

Text detection

Automatic number plate recognition is the software trained to understand different license plates, including an emphasis on the local standards. It accurately and quickly – just in seconds! – detects symbols and digits on input video. It can empower controlling barriers and gates for automatic opening of barriers for authorized vehicles and contactless access. The police also use it to locate vehicles they track for one reason or another. For example, the police force can catch a law offender in a few hours by identifying him from a roadside camera and then detecting his car by its number plate. The roadside camera analysis can still be manual, but number detection is automated.

Automatic number plate recognition is the software trained to understand different license plates, including an emphasis on the local standards.

Benefits of detecting objects in video

Whether applied for one feature improvement or fully upgrading business processes, custom computer vision solutions can significantly increase efficiency while decreasing human agents’ efforts and interventions.

Enhanced security

The machine can detect suspicious objects or violations of specific uniforms for dangerous manufacturing processes. Multiple security applications in video surveillance are based on object detection. For example, spotting people in restricted areas, suicide prevention, or automating inspection tasks in remote locations with computer vision.

Quality control

Every deviation from a “golden standard” may be flagged for a further human check. For example, computer vision detects and AI measures how long every single manufacturing cycle takes. If those periods seriously differ from the “golden standard”, it may indicate mistakes in the process. AI then flags those particular cycles for a human supervisor, who can double-check the video to validate or fix the operation.

Efficiency growth

It is not only counting products and boxes that influences efficiency growth. For example, object detection is used in agriculture for counting and monitoring animals to provide better care, hence better business results.

Faster and better decision-making

For instance, thanks to object detection, many breakthroughs in the medical community occurred. As medical diagnostics heavily rely on images, scans, and photographs, video object detection has become extremely useful for diagnosing diseases. One of the computer vision application examples is video pattern recognition, which identifies whether a sleeping newborn child moves asymmetrically or repetitively. If these patterns appear, it could indicate dysfunction, which could be cognitive. Consequently, computer vision technology improves decisions made on treatment.

Watch out for video object detection pitfalls

Being well-trained, a machine vision system can identify the existence of similar objects in new data within certain parameters of tolerance. However, poor data selection, inappropriate model architecture, training process, or other factors can decrease the algorithm efficiency dramatically on ‘unseen’ input (i.e., data similar to but not identical to the training dataset).

Occlusion and reacquisition

If video object detection is performed by splitting a video into frames and applying an image recognition algorithm to each frame, issues with the visual data could worsen the performance of the entire object detection model.

For example, occlusion and reacquisition challenge efficiency. If a video object detection recognizes a human who was then briefly invisible because of a pedestrian passing by, the system will need time to figure out that it has ‘lost’ the subject, and initialize reacquisition, thus making video detection slower and less effective.

Motion blur

However, a subject can be lost not only because of occlusion but due to a frame disruption caused either by camera or subject movement (or both). Moving objects can produce motion blur, which causes video defocus, bringing it beyond the ability of the recognition framework to identify objects in video.

Depending on the model functions, whether it analyzes the video frames separately or generalizes all the images available, the video detection could be able to identify the items even with motion blur.

Motion blur example, disturbing the process of detecting objects in video.

For the system just parsing frames, such understanding is impossible because each frame is treated as a complete (and settled) episode; moreover, repeatedly registering the same object on a per-frame basis,

But object tracking, in some cases, allows full picture restoration, and besides, it demands less computational power to identify and track a subject since, from frame #2 onwards, the system knows what it is looking for.

Avoiding Overfitting

Analyzing moving pictures might not be as easy as analyzing still images.

Overfitting happens when the model fits too closely to irrelevant details in the training dataset – hence the name. It can occur when the training video is different from the data generated in the actual use case scenario or if you have too little data for training, causing the machine to learn patterns that aren’t descriptive of what you’re looking for.

For example, you’re training the model to recognize cars, but all the vehicles in the training video data are trucks. As a result, it could easily overfit the shape of an object and not recognize regular passenger cars as cars. The same could happen with color if you’re training it to identify people, but all people in the training video file are wearing bright blue.

Tackling this problem at the core, remember the first step of making an object detection model: data collection. The more diverse training data are, the less overfitting will occur. The dataset could be enhanced, for example, for video detection, used for monotonous video surveillance, when the same objects constantly come and go. Solutions, such as regularization, data augmentation, and collecting better data sets, prevent this overfitting phenomenon from happening.

Real-time video object detection

Current industry imperative has moved towards the development of real-time detection systems, with offline frameworks effectively migrating to a ‘dataset development’ role for low or zero latency detection systems.

Some types of neural networks are well-suited to video object detection. In 2017, a collaboration between the Georgia Institute of Technology (GIT) and Google Research offered a video object detection method that achieved an inference speed of up to 15 frames per second operating on a mobile CPU.

However, the iconic field example is YOLO (You only look once), a state-of-the-art real-time object detection system. The latest version, the YOLOv7 model, is able to detect even very far-off and small objects in a video. Nevertheless, this ready-made model might not be enough for business problem solving, even though YOLOv7 is the first in the YOLO family to estimate a human pose.

Best practice for video object detection

  1. Training data for object detecting models should be as close as possible to data the machine will process in real life. For example, if the use case supposedly produces low-resolution and blurry videos (ex., CCTV footage), the data for AI training should also feature videos with these qualities. In general, training video footage should also include multiple angles and resolutions.
  2. The more different objects you want to recognize, the more data you need.
  3. If the objects to identify have a high variance (e.g., many different cars worldwide), you should also ensure that data you collect is varied. Machine tends to learn patterns very well. Thus, if it only sees some types of an object, it will not necessarily recognize a significantly different instance.
  4. Teaching a machine to recognize the minute differences between similar entities ensures more data on each object.
  5. Invest in labeling. The quality of labeling can make or break a machine vision system, and most of the time, quality beats quantity. If this means you have to label everything twice to ensure everything is done correctly, that is a good investment.
  6. Start small. If you aim to recognize 100 different objects, but it is possible to start with 10, then do it. You invest less in data collection and labeling, and you quickly get an idea of the accuracies you can achieve and the possible complexities in collecting the data you need.
  7. Ward off data leakage. During model training, the object detecting algorithm might use biased information, which will be unavailable during real-life predictions. It corrupts the results, performing over-optimistically during training, validation, and test stages but might poorly perform predictions for unseen data. Some leakage examples result from a bias caused by camera viewing angle or light conditions (morning/evening) or based upon videos with certain labels from specific regions, language groups, commentators, et cetera.

Using Video Object Detection is becoming increasingly important in today’s world. However, AI researchers are working hard on making this technology more accurate and closer to how a human being would analyze his environment.

Having delivered over 80 artificial intelligence projects involving cutting-edge computer vision applications, we are sure that AI-enhanced data processing provides considerable efficiency, productivity, and profitability benefits.

MindTitan specializes in providing computer vision development services that help to solve complicated business problems when ready solutions cannot help or when complex integration with other AI models is required.


New call-to-action

Go back