In this post we will define visual search and explore some of the key supporting technologies. For a practical demonstration, check out visualsearch.aragoai.com.
What is visual search?
Visual Search enables any image to be transformed into a searchable asset. This means that you can perform a search by simply describing what you want to see in an image for example "a starry night over the mountains" and the most relevant results will be returned.
This means no need for keyword matching, no need for filter systems and no need for investing in unscalable search systems that require significant manual effort to maintain.
Check out the following video where we demonstrate Visual Search on visualsearch.aragoai.com.
How does it work?
In this section we will give a high level breakdown of the key components. Subscribe for more technical walkthroughs in the future.
Visual Search has 4 stages:
- Understanding images
- Storing understanding
- Facilitating search
- Generating results
This modular design makes understanding Visual Search simple. Furthermore because each stage is independent and interacts with the others through well defined interfaces it enables the team at AragoAi to keep their offering at the cutting edge. As research progresses in any of the stages AragoAi can implement the latest advancements whilst the rest of the system operates unchanged.
Stage 1 - Understanding images
This is the stage with the most impact on final results. Whilst the implementation details can become quite technical the objective is extremely simple, namely we wish to do develop an automated process for taking an image and answering the question - What does this image describe?
Historically this was a very challenging problem; however, recent development in deep learning research, particularly related to the scaling of Transformer architectures has made the objective manageable.
The dataset required for training a model that can answer this question looks like a large collection of image - text pairs that contains the answer to our question What does the image describe. Most researchers then apply a variation of a sequence to sequence based Transformer and define a relevant loss function. In the case of a multi task architecture, which has empirically been shown to improve results on individual tasks, the loss function may look like the below:
\(L = -\sum_{i=1}^{|y|} \log P_{\theta}(y_i | y_{<i},x,s) \)
Which is the formula for the cross entropy loss, with model parameters \(\theta \), a sequence output \(y \), an input \(x \) and an instruction \(s \).
For visual understanding we can define the input as a representation of the image, the instruction as What does the image describe? and the output as the answer to our question.
Once an adequate model has been trained on this task, we can then apply batch inference to a dataset of choice. At the end of this stage a company has their images and text representing what is in each image.
Stage 2 - Storing understanding
This stage takes the textual representation learnt from stage 1 and then attempts to translate this understanding into a semantically rich numerical representation. Which is then stored in an efficient data structure.
To do so we again turn to transformer based architectures; however we now define a different objective. Namely we require the model to learn to group like pairs of text close together in a numerical space and dissimilar further apart.
In the literature this is commonly done by defining a Contrastive Loss which is then minimized. A bare bones mental model for how you could approach implementing this in a self-supervised setting is as follows:
- Take a piece of text which you define as an anchor
- Apply augmentation to the anchor which you define as a positive match
- Randomly select another piece of text from your dataset as a negative match.
- Define two components of the loss for positive and negative matches respectively
- If you sample a positive match calculate your loss as the distance in the numerical space to the anchor
- If you sample a negative match add the distance to your loss only if that distance is below a defined margin.
With this approach and an adequate architecture the model will learn to reduce the distance between positive matches and expand the distance between negative matches. Note the process I've described above has limitations. For example the sampling technique proposed could introduce false positives. But it's a good intro to the topic.
With the model trained it can then be run on all outputs from stage 1 to generate numerical representations. These are then stored in a data structure that can facilitate fast search and retrieval on machines with low RAM and compute resources. This enables real time search and is coupled with solving a challenging optimization problem, to do so we leverage current state of the art solutions.
Stage 3 - Facilitating search
This stage involves transforming a user's search query into a meaningful numeric representation such that it can be used to find the most relevant results. By doing this conversion we remove all need for a company to manually categorize their images, maintain keywords or filters.
This task is very similar to what was required in stage 2. In fact, we use the same model to generate a meaningful representation of a user's search query.
At the end of this stage a company has their images, the data structure storing the understanding and a numeric representation of a user's search query.
Stage 4 - Generating Results
The representation of the search query is now compared to the representation of our understanding of the images. We define a distance metric that scores how similar the search query is to the images, which is labeled as relevance. There are several possible choices, the metric AragoAi currently employs for Visual Search means that the highest possible relevance is 1.00 and if it's 0.00 then the image is not at all relevant to the query.
Because of the choices made in the earlier stages this score can be calculated in real time for the images hence enabling retrieval of the most relevant one. At the end of this stage Visual Search is complete and we have the most relevant images for a user's query!
Deploying Visual Search to production
Consider subscribing to learn more about how AragoAi deploys Visual Search...