What Is Computer Vision? Why Deep Learning Changed It All

What Is Computer Vision?

Machine learning, in particular, deep learning, has transformed computer vision in just a few short years. Computer vision technology is one of the most promising areas of research within artificial intelligence and computer science, and offers tremendous advantages for businesses in the modern era.

At its heart, the field of computer vision focuses on designing computer systems that possess the ability to capture, understand, and interpret important visual information contained with image and video data. Computer vision systems then translate this data, using contextual knowledge provided by human beings, into insights used to drive decision making. Turning raw image data into higher-level concepts so that humans or computers can interpret and act upon them is the principal goal of computer vision technology.

An important distinction must be made between computer vision and image processing. Image processing is the science of making alterations to one image such that it produces a new image with certain enhanced characteristics. These changes include increased resolution, normalized brightness, and contrast, cropping, blurring, or any other digital transformation needed for a specific purpose. Digital image processing does not take into consideration the actual content of the image – it is simply a series of mechanical transformations undertaken to alter the image for some defined purpose.

A modern computer vision definition, on the other hand, means enabling computers to process visual data and extract insights from that data. The content is important, as is the ability to translate raw pixels into a form interpretable by human or other computer systems. The goal is to teach computers how to identify, classify, and categorize the visual world as we do.

The Challenge in Computer Vision

Human vision systems have the tremendous advantage of being informed by a lifetime of experiential knowledge that helps to contextualize the data within your field of view. Your eyeballs capture visual information — the image of a cat, for example — and your prior experience interprets this collection of reflected light and relates it to the concept of a cat. The complexity of our visual perception system and its close relationship to our memory and higher reasoning capabilities give this visual data the context it needs to provide value in day-to-day activities.

These human faculties, while unavailable to computers, can be mimicked effectively through machine learning algorithms. But as it turns out, teaching machines to mimic this basic human function, proudly demonstrated by five-year-olds all over the world, is exceptionally difficult. Solving this problem continually occupies the brightest minds in AI research.

Old School Computer Vision Systems

Prior to 2012, the design of computer vision systems looked remarkably different than it does today.

If we were to tease apart our understanding of a cat, we see that a cat is really an amalgamation of several different “features”. These features include a head, ears, a body, four legs, and a tail – all of which combine to trigger our memory systems and higher-order cognitive functions to produce the conceptual understanding that we are seeing a cat. We can break these down even further. A head is composed of two eyes, a nose. Legs include long, roughly cylindrical shapes with paws attached, each with four oblong-shaped toes.

Training computer vision systems used to involve following this process all the way to the smallest granular units of visual data – the pixel. The system records and evaluates digital images on the basis of its raw data alone, where minute differences in pixel density, color saturation, and levels of lightness and darkness determine the structure and therefore identity of the larger object. A particular arrangement of pixels may indicate a whisker on a cat’s nose or a human ear, for instance. The image below shows the constituent features of an image broken down into a set of pixel densities.

How computers interpret images from raw pixel data. Source: Stanford University Computer Vision Lab

Early computer vision techniques relied on extensive manual effort to build rules-based classification techniques to detect and classify certain groups of these pixel arrangements. Human beings manually selected what they believed were the relevant features of individual objects. They explicitly told the machine, “cats are made of legs, legs are made of thighs and paws, and paws are made of toes.”

Engineers codified each of these components of a cat into a computer as rigid rules that could detect these features within the image. For 30 years, the field of computer vision technology relied on these burdensome, manually-crafted feature detectors to sort and classify image data.

These mechanisms were inflexible, difficult to improve or alter, and very time-consuming to produce manually for each new application or object in need of detection. In addition, when the number of classes the model attempts to classify increases or the image clarity decreases, traditional computer vision techniques tend to fail often. Simple changes in object size, rotation, or orientation would break these systems. To achieve high performance these vision systems need to be agnostic relative to these factors.

Computer Vision and the Rise of Machine Learning

Advances in machine learning altered forever the destiny of computer vision technology. Deep learning, in particular, made computer vision algorithms highly effective in the real world. The advent of the convolutional neural network made computer vision feasible for industrial applications and cemented the technology as a worthy investment for companies looking to automate tasks.

Traditional machine vision techniques begin with a top-down prescription of the components that constitute the image – its “features”. Deep learning models flip this entire process on its head. The deep neural network training process uses massive data sets and countless training cycles to teach the machine, from the bottom-up, how a cat looks. During the training process, the algorithm automatically extracts the relevant features of ‘cats’ in general. This process produces a model that can be applied to previously unseen images to produce an accurate classification. The image below describes the traditional machine learning process for image recognition and object detection compared to a deep learning-based approach.

Deep learning workflow for computer vision.

Deep learning in computer vision was made possible through the abundance of image data in the modern world plus a reduction in the cost of the computing power needed to process it. Large scale image sets like ImageNet, CityScapes, and CIFAR10 brought together millions of images with accurately labeled features for deep learning algorithms to feast upon. Seemingly overnight, the performance of deep learning algorithms surpassed thirty years of work on manual feature detectors.

Feeding a sufficient number of well-labeled images to a deep learning-based visual system enables it to understand the exact pixel-level nuances that define the individual components of the larger image. It will automatically learn where the edges are, and how particular combinations of edges, which differ in color and contrast from each other and from the background, combine to form certain features. The image below shows how a deep learning system may identify the features of a cat.

Deep learning classification of an image of a cat.

Higher-order convolutional layers of the neural network will begin to understand that if you have 4 legs, a head, a tail, and a body, that the image in question may contain a cat. From raw visual pixel-level data, the machine returns a higher-order concept – “cat” –  based on the sequential addition and classification of these individual components. The image below demonstrates this progressive understanding in the context of human facial recognition.

Feature extraction in convolutional neural networks.

The difference is that traditional vision systems involve a human telling a machine what should be there versus a deep learning algorithm automatically extracting the features of what is there. The bottom-up approach is vastly more effective for certain kinds of image analysis problems, many of which we use frequently in our daily life.

The difference is subtle, and perhaps not obvious in the context of our cat. When diagnosing a tissue sample, however, these differences become valuable. Small, imperceptible fluctuations in pixel density can signify the early onset of cancer – details that even experts pathologists might miss. Human visual perception hits a barrier of image resolution at about 2290 pixels per inch and hits a ceiling for movement perception at about 30 frames per second.

Additionally, humans fatigue when performing repetitive observations on many images. This fatigue results in poor business outcomes, as is the case with visual inspection in manufacturing quality control. Fatigue also poses risks to human life, as is common across many medical disciplines or in various professions such as infrastructure or aircraft maintenance. Automated inspection processes can save money and lives in these domains.

The ability of computer vision systems to operate with pixel-level precision, iterate rapidly, and perform consistently over time offers incredible potential to augment or outperform human perception.

Today’s Technology Trends – A ‘Perfect Storm’ For Commercialized Computer Vision

There are four main factors driving the widespread commercialization of computer vision technology for use in industry.

AI / Machine Learning Algorithms

Advances in AI and machine learning algorithms, specifically deep learning techniques, made it possible to analyze the mountains of information present in the modern age. Buoyed by the declining cost of compute power, deep learning algorithms can crunch billions of pieces of data to produce models that are orders of magnitude more complex than their predecessors. Algorithmic automation of neural network training process produced tremendous gains in efficiency.

The proliferation of pretrained, open source machine learning models within the data science community democratized access to the latest techniques. These advances extended the ability of machine learning engineers to experiment rapidly with large data sets to solve increasingly complex computer vision problems. Below we can see the evolution from simple neural networks into architectures especially designed for vision applications.

evolution of neural network design

Chart showing the evolution of neural network design.

Data Abundance

The sheer volume of data in the modern world can supply machine learning models with all of the raw data they need to feast upon. Social media data, mobile data, IoT device streaming data, and the growing digitization of entire industries is the fuel that deep learning networks need to become hyper-efficient at certain tasks. The underlying infrastructure that supports corporate digital transformation initiatives is now commercialized, accessible, and becoming adopted at scale. The business world is now capable of organizing its massive amount of data in a sophisticated and centralized way.

graph of worldwide data volume over time

Graph of worldwide data volume over time (measured in zetabyes). Source: Statista


Exponential levels of connectivity allow data to be transferred quickly and efficiently. IoT sensors, mobile phones, cameras, and drones created massive amounts of new image data. The internet and the fiber optic cable network that supports it connects nearly every person on Earth. Ubiquitous access to wireless connectivity created powerful arteries for large-scale image data exchange which converge within a centralized access point in the cloud. Computer vision software deployed locally “at the edge” can receive inputs and transmit output data anywhere in the world.

graph of number of iot connected devices over time

Graph of the number of connected devices over time (in billions). Source: Statista

Computational Power 

The cost of computing dropped exponentially over the last few decades. One estimate indicates that the amount of computing power available per dollar has increased by a factor of ten roughly every four years over the last quarter of a century (measured in FLOPS or MIPS). This cost reduction originates in the widespread commercialization of large-scale data centers, advances in semiconductor materials, and the development of specialized hardware purpose-built for training neural networks (TPUs, for instance). The below chart highlights the exponential decline in the cost of computing from the year 1956 through 2019.

history of the price of computation

History of the price of computing power. Source: Nielsen

These are the rising tides whose collective force paved the way for modernized, commercial applications of computer vision technology. Machine learning techniques such as deep learning accelerated computer vision technology development above and beyond the threshold required to find business value. Visual tasks such as image classification, object recognition, and image segmentation can now achieve high performance in a cost-effective manner for enterprise deployment. Widespread adoption is an inevitable reality.

Dynam.AI Computer Vision Solutions

Dynam.AI offers end-to-end AI solutions for companies looking to capitalize on the forces sweeping the modern business landscape. Our multidisciplinary team of AI experts, machine learning engineers, and data scientists have produced innovative solutions for the largest companies in the country. We have relevant experience spanning healthcare, financial services, infrastructure, and manufacturing. Our deepest expertise lies in deep learning for computer vision and the integration of innovative new AI technologies into traditional business operations.

Request a free consultation at Dynam.AI today to learn how computer vision and machine learning can transform your enterprise.

Dr. Michael Zeller

Dr. Michael Zeller has over 15 years of experience leading artificial intelligence and machine learning organizations through business expansion and technical success. Before joining Dynam.AI, Dr. Zeller led innovation in artificial intelligence for global software leader Software AG, where his vision was to help organizations deepen and accelerate insights from big data through the power of machine learning. Previously, he was CEO and co-founder of Zementis, a leading provider of software solutions for predictive analytics acquired by Software AG. Dr. Zeller is a member of the Executive Committee of ACM SIGKDD, the premier international organization for data science and also serves on the Board of Directors of Tech San Diego.  He is an advisory board member at Analytics Ventures, Dynam.AI’s founding venture studio.