Managing Computer Vision Projects – Part 1: Your Dataset

Enterprises looking to implement computer vision projects with real-world impact need to understand each stage of the computer vision project management cycle. This article will explain guidelines for machine learning project management, including how to build high-quality computer vision datasets. It will also discuss how to clean a dataset for use by deep neural networks and learning algorithms.

Companies with previous experience building or using AI tools or business intelligence software may be familiar with the Cross-Industry Standard Process for Data Mining (CRISP-DM). This process is a conceptual workflow popular among data science teams. A modified version of the CRISP-DM below overlays the six steps of the computer vision-specific project life cycle.

The computer vision project life cycle mapped to the CRISP-DM.

Step 1: Define An Objective For Your Computer Vision Project


The most important step to any computer vision project is to establish a clear objective for your machine learning algorithm to achieve. Your objective may vary widely based on your specific use case.  It will determine the type and quantity of data required for a high quality computer vision dataset. Here are some common objectives:

Types of computer vision objectives. Source: O’Reilly



Classification is one of the most common objectives for computer vision projects. It is also a strong foundation on which to combine other components to build more advanced vision systems.

The goal of classification is to use the raw visual data contained within an image to assign a certain conceptual label, or “class” to that image.

Most computer vision algorithms involving classification will return one single choice out of a number of possible choices. It will help you to determine if the object in the image is an apple or an orange.

Object classification output.

Imagine you run a recycling facility that processes and sorts recyclable bottles, cans, and paper products. Your machinery puts all of these items together onto a long conveyor belt that terminates in a sorting mechanism. A computer vision classification algorithm could quickly label each item as a certain class – bottle, can, or paper product. Regardless of the shape, size, color, or density of a bottle, the algorithm would still label that as “bottle” and sort it effectively.



Tagging is an objective that seeks to define an overall image with a series of labels. The output of this type of system might contain one label, two labels, X number of labels, or no labels. 

The difference between tagging and classification as an objective is that tagging seeks to recognize multiple objects or concepts within an image and apply labels that correspond to each. These labels are additive, meaning any number of them can be applied to build up a multi-dimensional description of what the overall image contains.

Image tagging output. Source: Cloudinary

The digital media collection and distribution site Pinterest uses tagging, among other machine learning-based processes, to process the thousands of images uploaded to their site daily. Tagging algorithms process images and apply labels such as color, brightness level, contrast level, etc. Each image may contain dozens of different tags. Machine learning techniques allow quick and easy retrieval of all similar images based on their shared tags through this automated process.



Detection is the process by which vision systems find specific objects and localize them to a certain area within an image. If you care where in your image the object is found, this is your objective.

The output of detection algorithms is a bounding box that outlines the location of the object.

Object detection on a construction site. Source: YouTube

A valuable application of an object detection algorithm might include crowd detection and counting. A computer vision system can detect each person within its field of view and localize them to a certain area within the photograph. The number of boxes can then be counted to understand the overall numbers of people in the photograph.



Segmentation is an advanced form of detection that occurs at a very fine resolution ( pixel-level precision). The purpose of segmentation is to segment objects in an image according to their exact boundaries and separate them from each other.

This process is highly attuned to the minute differences in the raw visual data. The most common applications of this process are those in which the risk to human health is very high, ex: autonomous vehicle development or healthcare applications. 

Pixel-level segmentation for autonomous vehicles. Source: MIT Deep Learning

This increase in detection efficacy brings more innovative projects into the realm of possibility, however, segmentation-based computer vision projects make several important tradeoffs. Training ML models to detect and resolve objects at such a fine degree of resolution comes with greater overhead costs related to raw data collection and processing power. Because these tradeoffs have tangible effects on project cost, complexity, and overall likelihood of success, it is absolutely necessary to determine beforehand the level of detection resolution that your project requires. 

Note: for many healthcare AI applications like identifying cancerous components of cells or other types of tissue diagnoses, the pixel-level precision achieved through segmentation is a required benchmark to consider the model useful and ethical for use by humans.

Video Tracking


Tracking is a task involving both classification and detection (applying a label to an object and localization of that object within an image) that occurs repeatedly throughout a frame-by-frame series within a video. 

Video tracking system for foot traffic. Source: Stanford Vision Lab

Video tracking is a complex problem. Not only must the algorithm classify and detect objects, but it must do so in an environment that shifts dynamically over time, whereby objects can pass in front of each other or behind others. 

How does an algorithm keep track of, say, several individuals moving in and out of a crowd? How does it know that a person is still the original object when he or she disappears behind someone else, then reappears? To what degree of thoroughness does our project need to track these objects?

These are the types of questions, constraints, and factors you must decide ahead of your project or risk running into expensive and time-intensive problems further down the road.

Action recognition


Action recognition is an objective whereby a computer vision algorithm recognizes the position and orientation of the human body and can detect or predict actions as they occur in real-time. Some use cases for this function include posture analysis for security purposes, fall prevention in the elderly, or correcting movement imbalances in physical therapy. Posture recognition is also very popular by sports organizations to optimize training of specific movements such as pitching, punting, and swinging.

Human pose estimation algorithm. Source KDNuggets

These are the most common ‘North Star’ objectives that will determine the course of your project.

Step 2: Identify Your Computer Vision Project Data


Collection and organization of a high quality computer vision dataset is the second most critical step in the machine learning project management cycle. This step can make or break a nascent computer vision system. Machine learning models can produce exceptional insight but only when they are nourished with high-quality, well-labeled, and relevant data that accurately encapsulates the problem space.

Remember: garbage in = garbage out.

The 4 Pillars of Great Data

Quantity – “enough volume of images”

The use of deep learning in modern computer vision systems requires tremendous amounts of visual data to function correctly. Because deep learning learns relevant features using a bottom-up approach, the overall quantity of images must be great enough that the model can ingest enough examples to build an accurate representation of each individual component within an image.

This number will vary by project scope, objective, and overall level of complexity. But generally, these algorithms require anywhere from a few thousand images up to a million images to achieve high levels of performance.

Before investing time and resources into data procurement, remember that your level of desired performance dictates the number of images you will need. One of the advantages of ML-based computer vision over traditional methods is the ability to validate a model using a smaller set of images, and then supply more images over time to hit your desired performance goals.

Diversity – “images from a diverse set of perspectives, colors, or orientations”


In addition to accurately-labeled, relevant data, high-performance models demand a high level of diversity in your initial training set. Diversity means including images of your object of interest under different lighting conditions, in different orientations, from different perspectives, and in various colors. Taking multiple images of an apple under different conditions, while keeping the apple as the central thematic focus, will teach the machine to discern the pixel-level differences between the object itself and its environment. These techniques cement its understanding of the characteristics of apples and deepen its ability to classify them as such in unknown environments.

Another important element is to include enough ‘negative’ examples of images. In other words, provide enough labeled images of instances where the object of interest does NOT appear. Inclusion of negative examples adds an additional dimension to the model’s learning process.

Finally, sufficiently diverse computer vision datasets include a healthy mix of good quality and bad quality images. Common mistakes are to train your model on an image set composed mostly of good images, which shelters from the reality of the data it will encounter in a live environment. This mistake can cause overfitting and a decline in performance on unseen data.

Later, we discuss some simple ways to automatically generate more training examples that improve the diversity of your training set.

Accuracy – “relevant images and accurate labels”


Labeling data accurately is incredibly important. Poorly-labeled training examples can influence the performance of your model in unforeseen ways – even a couple of incorrect examples can drastically reduce the performance of your model or affect its ability to generalize onto unseen data. Remember: computer vision systems know only what you tell them upfront. If you tell the system that the object in view is X and not Y, that label will form the basis of its observations moving forward.

Your training images should display high similarity to those you wish to classify when deploying your machine learning model in live production. A common pitfall is to supply the vision model with generic images found online and expect good performance in a live environment. While this may work in some cases, often times the subtle differences in backgrounds, texture, color contrast, and orientation can cause the model to learn incorrectly. Combined with indiscriminate and incorrect labeling, these subtle forces compound and degrade the performance of your model.

Quality – “the right images in high-quality resolution”


Most training data images should be of high quality with objects of interest in full view and easily identifiable. With regard to pixel resolution, there are no hard and fast rules. Higher-resolution generally correlates with higher performance. But the requirements of your project depend on the threshold for performance your model seeks to achieve. In many cases, there are diminishing returns past a certain resolution threshold.

One simple but useful criterion is to observe an image example and ask, “could a human being identify the object of interest in this picture?” If a human can perform the task, the training example is likely of sufficient quality.

A note on privacy: many of the most attractive computer vision projects either involve human beings directly or rely on image data that may be considered private. Examples include facial recognition and surveillance, and image data such as license plates. Project managers should always consider ethical and privacy concerns when gathering and manipulating data of this nature.

Computer Vision Project Data: 4 Ways To Build a Great Dataset


Use proprietary internal data


The first method for building computer vision datasets is to use data that your organization already possesses in sufficient quantity and that fits the characteristics outlined above. Many digitally sophisticated firms maintain organized records for their visual data, whether that’s inspection reports, manufacturing defect tracking, or various data supplied by customers. Other firms have large volumes of unstructured data, which contain relevant images but require large amounts of pre-processing to massage it into a usable form.

“The design and installation of a centralized system for storing and tracking data will become critical over the coming years to the success of any enterprise looking to implement machine learning projects”

– Dimitry Fisher, CAIO, Dynam.AI

Sufficient reservoirs of this type of data are ideal for ML projects because it is necessarily the most aligned with your business case and is free if already available in high quantities.

Build project-specific data capture mechanisms


If companies do not possess the data they need and cannot acquire it from a third party, one option is to design and implement a method to capture that data themselves.

These methods include installation of IoT devices such as cameras, video surveillance equipment, sensor networks, drones, or even the design of a mobile application. The more notorious examples of data capture mechanisms in popular culture include applications like Instagram, Snapchat, and Facebook, which serve as mechanisms to collect, aggregate, sort, and analyze tremendous amounts of visual data.

Companies need to be creative in how they approach building their computer vision dataset. In our maritime inspection software example, the company might decide to design a free mobile phone application for vessel-owners that would enable them to submit photos of damage to their vessel in exchange for a free quote on repairs. This application would require investment in software engineering plus the domain expertise required to make an accurate repair diagnosis.

But the payoff is in the form of visual data aggregation – all user-generated, with real examples of damage, and automated, accurate data labels according to boat type, color, etc… This data can be used to build, enhance, or refine a proprietary dataset for use in model training, one that is unavailable to competitors. For a machine learning-based technology company looking to solidify market leadership with a new innovative computer vision product, this data is pure gold.

Find publicly available computer vision datasets


Another source of data can be found in publically available image datasets. Companies can locate these datasets online, through partnerships with other firms or third-party vendors, academic institutions, or the government. The advantage of procuring this type of data is that it is usually well-organized, large-scale, and generally built around impactful problem spaces (cancer-related data, as an example).

Check out some of the most well-known, public computer vision datasets here, as well as more uncommon ones.

Companies that find third party data avoid the overhead cost of gathering the data themselves, although that data may be of questionable quality or require further investigation to determine suitability for their exact purpose. Furthermore, publically available data may be used to validate proof-of-concept projects before the company invests in the collection of a superior computer vision dataset.

The disadvantage of publically available data is just that – it is available to everyone to feed into their models. Let’s take the example of a company looking to build a completely innovative visual inspection solution for automated inspection of maritime vessels. They locate a dataset composed of photographs of the hulls of ships from a third- party vendor and begin training their model.

Meanwhile, a competitor has also trained their detection algorithms on the same dataset. In the absence of additional data, the two companies have essentially created an equivalent product. The quality of ‘performance’ is taken off the table as a differentiating factor and the two products must compete for profitability inside a more narrow competitive landscape.

Check out this article describing the competitive advantages of proprietary data for AI-based technology initiatives.

Generate synthetic data


Companies that have an existing computer vision dataset can enhance the quality and diversity of that dataset using a technique known as synthetic data generation. This technique involves image processing and manipulation according to a series of predefined transformations to generate multiple training examples from a single image.

Potential transformations include:

Horizontal and vertical flipping: useful for classic object detection where orientation does not matter.

Brightness and contrast: useful for training flexible models that are effective under different lighting conditions.

Rotation: useful for 3D image classification where a model must classify objects under different orientations with small degrees of rotation.

Color manipulation: Changing the hue, saturation, or producing grey scale examples of each image can teach a machine to recognize color distinctions.

Warning: Be cognizant of your objective when applying this technique. Indiscriminate use may taint your dataset with examples contrary to your objective. For example, horizontal flipping is not applicable to a problem space designed around classifying the differences between mirror-image objects such as left and right-handed gloves. Doing so would confuse the model and result in inaccurate classifications.

Synthetic Data In Action


Dynam.AI used synthetic generation techniques in a project with Titleist, the nation’s leading golf club manufacturer.

The Dynam.AI team built a mobile application for Titleist that used deep learning-based optical character recognition (OCR) to automatically determine the exact configuration of their customers’ golf clubs out of dozens of possible options, and log that information with high precision. There was just one problem. Titleist produces new golf clubs frequently and changes the characters inscribed on its clubs often. Retraining the computer vision model for each new set of clubs proved to be prohibitively expensive.

Synthetic data generation was the answer. The Dynam.AI team applied these techniques to generate tens of thousands of synthetic images involving different characters, shapes, and configurations in order to train the computer vision model at a much lower cost. This technique reduced ongoing maintenance costs by 90% and moved the entire project from untenable to feasible.

Step 3: Prepare Your Computer Vision Dataset


We’ve established that data quality is critically important to the success of your project, but the collection is only half the battle. The other half – data cleaning – falls squarely on the shoulders of your data science team to get that data into a usable form.

Data labeling is an important function within this step. Proper labeling provides your neural network with the correct classifications, or ‘ground truth’, it needs to learn. This source of truth is fundamental to successful model training.

For computer vision projects, classification is often the foundational objective that supports the entire model. For this reason, the definition of concepts – or ‘ontology’ – is of extreme importance. An ontology is the dictionary that data science teams use to define the labels that they apply to the objects surfaced during live classification.

These labels depend on the characteristics of each object, but more importantly, the differences between these objects. If you’re training a model to classify labradors and poodles, what label would you apply to an image of a labradoodle?

You could simply add another class – labradoodles – to your potential list, but this is a slippery slope. You now need enough training images of this new class to accurately categorize future prospects, and as a result, the complexity of your model increases along with training time and associated overhead costs.

The ability to navigate the labyrinth of decision-making in this arena is the reason that data science talent is in such high demand. This talent exists on a continuum. Data augmentation insights and techniques possessed by individuals on the far right of this continuum can fundamentally alter the performance of your project. One insightful transformation applied to a single column in a dataset can drastically improve a model’s performance and transform that project from a failure into a success. These decisions are make-or-break for computer vision projects.

Companies can either label their computer vision dataset in-house or use an outsourced team to perform this process.

In-house labeling


Data labeling will require the use of an interface, especially in more complex problem spaces involving specific object detection or localization where your data scientists must manually draw bounding boxes or segment- specific areas.

Popular data-labeling interfaces include VGG Image Annotator as well as LabelME, an interface built by MIT researchers.

Out-sourced data labeling


Firms exist that will shoulder the responsibility for labeling high volumes of data accurately and efficiently. Outsourcing to these types of third-parties can reduce the overall burden on project managers, shorten the time to completion, and free up the time of data scientists to work on more complex aspects of your project.

The disadvantages include the risk of inaccurate labeling, which can defeat the entire purpose. Additionally, this method adds security-related risk by exposing sensitive or proprietary data to third parties. These concerns highlight the importance of vendor selection as part of your planning process.

To be continued…


This article is part of a series of articles outlining the entire computer vision project management process from start to finish. You can download the entire 33-page PDF “2020 Executive Guide to Computer Vision in the Enterprise” below:

About the author

Dr. Michael Zeller has over 15 years of experience leading artificial intelligence and machine learning organizations through business expansion and technical success. Before joining Dynam.AI, Dr. Zeller led innovation in artificial intelligence for global software leader Software AG, where his vision was to help organizations deepen and accelerate insights from big data through the power of machine learning. Previously, he was CEO and co-founder of Zementis, a leading provider of software solutions for predictive analytics acquired by Software AG. Dr. Zeller is a member of the Executive Committee of ACM SIGKDD, the premier international organization for data science and also serves on the Board of Directors of Tech San Diego.  He is an advisory board member at Analytics Ventures, Dynam.AI’s founding venture studio