The Easiest and Fastest Way to Use Hugging Face's SOTA Model
SOTA is an abbreviation for State-Of-The-Art,
referring to "today's leading AI models."
This post will analyze how to integrate and utilize Hugging Face's SOTA models.
We will create an AI function based on user needs and, by submitting a natural language query via chat, demonstrate how the AI function responds using a practical example. This process involves
importing and registering a SOTA model from Hugging Face.
For this example, we will use the CLIP model, one of many SOTA models available. CLIP stands for Contrastive Language-Image Pre-training model and was developed by OpenAI. This neural network architecture is designed to understand natural language and process computer vision, enabling machines to comprehend human language and images. Since 2022, CLIP has been a vital tool behind the rise of AI-generated images. OpenAI's DALL-E was also built using CLIP.
Searching for CLIP in the 'Models' tab on Hugging Face will find over 2,900 registered models (as of September 20, 2024), as shown in the image below.
We will select the clip-vit-base-patch32 model, a zero-shot image classification model.
This model embeds text and images in the same space, and we will explore how it can be practically applied.
Below is a video demonstrating how to import a SOTA model from Hugging Face into ThanoSQL, create a custom function, and use that function to search for images. (A second video is included later in this post.)
Creating an AI Function with a SOTA Model
We aim to create an AI function that can find images similar to a given input image or retrieve an image based on the user's input.
We have named this function embed_clip, and its implementation is shown below.
DROP FUNCTION IF EXISTS embed_clip(text);
CREATE or REPLACE FUNCTION embed_clip(text_or_image TEXT)
RETURNS VECTOR AS $$
BEGIN
RETURN thanosql.embed(
engine := 'huggingface',
input := text_or_image,
model := 'openai/clip-vit-base-patch32',
model_args := '{"device_map": "cpu"}'
);
END;
$$ LANGUAGE plpgsql;
You can declare and implement the function in ThanoSQL's Query Manager (QM). By setting the engine to Hugging Face and defining the model as openai/clip-vit-base-patch32, you can immediately start using the CLIP-based SOTA model from Hugging Face.
By pressing the triangular button at the top of the Query Editor, you can execute the function, and the results will appear in the query results window below.
Creating a Sample Table for the AI Function
Next, we will create a sample table to use with this function.
We need a table where images are stored. If such a table does not exist, you can create a new one, import an existing one from a database, or upload a CSV file.
For this example, we downloaded ten images from the web and created a table named unsplash_meta_sample. The schema of that table is shown in the image below.
You must embed the data first to classify and search through unstructured data like images. We will embed the sample table using the embed_clip function we created earlier. The picture below shows the query that applies the embed_clip function to the sample table and stores the results in a table named test_vector.
DROP TABLE IF EXISTS test_vector;
CREATE TABLE test_vector
AS SELECT photo_id
, photo_image_url
, photo_submitted_at
, photo_description
, embed_clip(photo_image_url) AS embedding
FROM my_data.unsplash_meta_sample;
SELECT * FROM test_vector
Once the query is successfully executed, you can view the embedded results for each image in the Query Result window, as shown below.
In the Data Viewer, the embedded images will appear as follows.
Searching for Images Using an Image
Next, download an image from the web. We will download a desert photo from Unsplash. Copy the address (URL) of the image.
Return to the ThanoSQL workspace, select AI Chat from the menu on the left, and choose the test_vector table created earlier
Now everything is ready. Let's use the chat to find three photos similar to the desert image from Unsplash.
In the chat input box, enter the following:
"Embed the image at https://images.unsplash.com/photo-1464822759023-fed622ff2c3b using the embed_clip function and find the three most similar photos."
This command passes the image URL to the custom function, which executes a query to find the three most similar photos from the sample table.
After entering the chat input and pressing Enter, the query will be executed, and the results will appear as shown below.
The executed query can also be viewed in the Query Editor, as shown in the image below.
You can check the retrieved images through the Data Viewer.
Searching for Images Using Text
Earlier, we searched for images using an image as input. Now, let's search for images using text.
The video below demonstrates how to search images using text.
For example, to find a photo of a person surfing in the sea:
In the ThanoSQL workspace, select AI Chat from the menu on the left and choose the test_vector table created earlier.
In the chat input box, enter the following:
"Embed 'a person surfing in the sea' using the embed_clip function and find the most similar photo."
This command embeds the user's text input and passes it to the function, which then executes a query to find the most similar image from the sample table.
After entering the chat input and pressing Enter, the query will be executed, and the results will appear as shown below.
The executed query can also be viewed in the Query Editor, as shown in the image below.
You can directly view the retrieved images by clicking the link in the response below.
As with the previous example, you can view the images through the Data Viewer.
Conclusion
In this post, we demonstrated how to import a SOTA model from Hugging Face, create a custom function, and use it to search for images in two diverse ways. As shown, ThanoSQL allows anyone to easily and quickly apply AI models from Hugging Face or other external sources and run them instantly.
Next time, we will explore a use case involving a machine learning model other than an image classification model.