Overview
This article is a write-up on how I built Text2Art.com in a week. Text2Art is an AI-powered art generator based on VQGAN+CLIP that can generate all kinds of art such as pixel art, drawing, and painting from just text input. The article follows my thought process from experimenting with VQGAN+CLIP, building a simple UI with Gradio, switching to FastAPI to serve the models, and finally to using Firebase as a queue system. Feel free to skip to the parts that you are interested in.
You can try it at text2art.com and here is the source code (feel free to star the repo).
Outline
- Introduction
- How It Works
- Generating Art with VQGAN+CLIP with Code
- Building UI with Gradio
- Serving ML with FastAPI
- Queue System with Firebase
Introduction
Not long ago, generative arts and NFT took the world by storm. This is made possible after OpenAI significant progress in text-to-image generation. Earlier this year, OpenAI announced DALL-E, a powerful text-to-image generator that works extremely well. To illustrate how well DALL-E worked, these are DALL-E generated images with the text prompt of “a professional high quality illustration of a giraffe dragon chimera. a giraffe imitating a dragon. a giraffe made of dragon”.
Unfortunately, DALL-E was not released to the public. But luckily, the model behind DALL-E’s magic, CLIP, was published instead. CLIP or Contrastive Image-Language Pretraining is a multimodal network that combines text and images. In short, CLIP is able to score how well an image matched a caption or vice versa. This is extremely useful in steering the generator to produce an image that exactly matches the text input. In DALL-E, CLIP is used to rank the generated images and output the image with the highest score (most similar to text prompt).
Few months after the announcement of DALL-E, a new transformer image generator called VQGAN (Vector Quantized GAN) was published. Combining VQGAN with CLIP gives a similar quality to DALL-E. Many amazing arts have been created by the community since the pre-trained VQGAN model was made public.
Painting of city harbor night view with many ships, Painting of refugee in war. [Image Generated by Author]
I was really amazed at the results and wanted to share this with my friends. But since not many people are willing to dive into the code to generate the arts, I decided to make Text2Art.com, a website where anyone can simply type a prompt and generate the image they want quickly without seeing any code.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
How It Works
So how does VQGAN+CLIP work? In short, the generator will generate an image and the CLIP will measure how well the image matches the image. Then, the generator uses the feedback from the CLIP model to generate more “accurate” images. This iteration will be done many times until the CLIP score becomes high enough and the generated image matches the text.
I won’t discuss the inner working of VQGAN or CLIP here as it’s not the focus of this article. But if you want a deeper explanation on VQGAN, CLIP, or DALL-E, you can refer to these amazing resources that I found.
- The Illustrated VQGAN by LJ Miranda: Explanation on VQGAN with great illustrations.
- DALL-E Explained by Charlie Snell: Great DALL-E explanations from the basics
- CLIP Paper Explanation Video by Yannic Kilcher: CLIP paper explanation
X + CLIP
VQGAN+CLIP is simply an example of what combining an image generator with CLIP is able to do. However, you can replace VQGAN with any kind of generator and it can still work really well depending on the generator. Many variants of X + CLIP have come up such as StyleCLIP (StyleGAN + CLIP), CLIPDraw (uses vector art generator), BigGAN + CLIP, and many more. There is even AudioCLIP which uses audio instead of images.
Generating Art with VQGAN+CLIP with Code
I’ve been using the code from clipit repository by dribnet which made generating art using VQGAN+CLIP into a simple few lines of code only (UPDATE: clipit has been migrated to pixray).
It is recommended to run this on Google Colab as VQGAN+CLIP requires quite a lot GPU memory. Here is a Colab notebook that you can follow along.
First of all, if you are running on Colab, make sure you change the runtime type to use GPU.
Steps to change Colab runtime type to GPU. [Image by Author]
Next, we need to set up the codebase and the dependencies first.
from IPython.utils import io with io.capture_output() as captured: !git clone https://github.com/openai/CLIP # !pip install taming-transformers !git clone https://github.com/CompVis/taming-transformers.git !rm -Rf clipit !git clone https://github.com/mfrashad/clipit.git !pip install ftfy regex tqdm omegaconf pytorch-lightning !pip install kornia !pip install imageio-ffmpeg !pip install einops !pip install torch-optimizer !pip install easydict !pip install braceexpand !pip install git+https://github.com/pvigier/perlin-numpy # ClipDraw deps !pip install svgwrite !pip install svgpathtools !pip install cssutils !pip install numba !pip install torch-tools !pip install visdom !pip install gradio !git clone https://github.com/BachiLi/diffvg %cd diffvg # !ls !git submodule update --init --recursive !python setup.py install %cd .. !mkdir -p steps !mkdir -p models
(NOTE: “!” is a special command in google Colab that means it will run the command in bash instead of python”)
Once we installed the libraries, we can just import clipit
and run these few lines of code to generate your art with VQGAN+CLIP. Simply change the text prompt with whatever you want. Additionally, you can also give clipit
options such as how many iterations, width, height, generator model, whether you want to generate video or not, and many more. You can read the source code for more information on the available options.
import sys sys.path.append("clipit") import clipit # To reset settings to default clipit.reset_settings() # You can use "|" to separate multiple prompts prompts = "underwater city" # You can trade off speed for quality: draft, normal, better, best quality = "normal" # Aspect ratio: widescreen, square aspect = "widescreen" # Add settings clipit.add_settings(prompts=prompts, quality=quality, aspect=aspect) # Apply these settings and run settings = clipit.apply_settings() clipit.do_init(settings) cliptit.do_run(settings)
Code for generating art with VQGAN+CLIP
Once you run the code, it will generate an image. For each iteration, the generated image will be closer to the text prompt.
Longer Iterations
If you want to generate with a longer iteration, simply use the iterations
option and set it as long as you want. For example, if you want to it to run for 500 iterations.
clipit.add_settings(iterations=500)
Generating Video
Since we need to generate the image for each iteration anyway, we can save these images and create an animation on how the AI generates the image. To do this, you can simply add the make_video=True
before applying the settings.
clipit.add_settings(make_video=True)
It will generate the following video.
Customizing Image Size
You can also modify the image by adding the size=(width, height)
option. For example, we will generate a banner image with 800×200 resolution. Note that higher resolution will require higher GPU memory.
clipit.add_settings(size=(800, 200))
Generating Pixel Arts
There is also an option to generate pixel art in clipit. It uses the CLIPDraw renderer behind the scene with some engineering to force pixel art style such as limiting palette colors, pixelization, etc. To use the pixel art option, simply enable the use_pixeldraw=True
option.
clipit.add_settings(use_pixeldraw=True)
Generated image with the prompt “Knight in armor #pixelart” (left) and “A world of chinese fantasy video game #pixelart” (right) [Image by Author]
VQGAN+CLIP Keywords Modifier
Due to the bias in CLIP, adding certain keywords to the prompt may give a certain effect to the generated image. For example, adding “unreal engine” to the text prompt tends to generate a realistic or HD style. Adding certain site names such as “deviantart”, “artstation” or “flickr” usually makes the results more aesthetic. My favorite is to use “artstation” keyword as I find it generates the best art.
Additionally, you can also use keywords to condition the art style. For example, the keywords “pencil sketch”, “low poly” or even artist’s name such as “Thomas Kinkade” or “James Gurney”.
To explore more on the effect of various keywords, you can check out the full experiment results by kingdomakrillic which shows 200+ keywords results using the same 4 subjects.
Building UI with Gradio
My first plan on deploying an ML model is to use Gradio. Gradio is a python library that simplifies building ML demos into a few lines of code only. With Gradio, you can build a demo in less than 10 minutes. Additionally, you can run the Gradio in Colab and it will generate a sharable link using Gradio domain. You can instantly share this link with your friends or the public to let them try out your demo. Gradio still has some limitations but I find it’s the most suitable library to use when you just want to demonstrate a single function.
So here is the code that I wrote to build a simple UI for the Text2Art app. I think the code is quite self-explanatory, but if you need more explanation, you can read the Gradio documentation.
import gradio as gr import torch import clipit # Define the main function def generate(prompt, quality, style, aspect): torch.cuda.empty_cache() clipit.reset_settings() use_pixeldraw = (style == 'pixel art') use_clipdraw = (style == 'painting') clipit.add_settings(prompts=prompt, aspect=aspect, quality=quality, use_pixeldraw=use_pixeldraw, use_clipdraw=use_clipdraw, make_video=True) settings = clipit.apply_settings() clipit.do_init(settings) clipit.do_run(settings) return 'output.png', 'output.mp4' # Create the UI prompt = gr.inputs.Textbox(default="Underwater city", label="Text Prompt") quality = gr.inputs.Radio(choices=['draft', 'normal', 'better'], label="Quality") style = gr.inputs.Radio(choices=['image', 'painting','pixel art'], label="Type") aspect = gr.inputs.Radio(choices=['square', 'widescreen','portrait'], label="Size") # Launch the demo iface = gr.Interface(generate, inputs=[prompt, quality, style, aspect], outputs=['image', 'video'], enable_queue=True, live=False) iface.launch(debug=True)
Code to build the Gradio UI
Once you run this in Google Colab or local, it will generate a shareable link that makes your demo accessible public. I find this extremely useful as I don’t need to use SSH tunneling like Ngrok on my own to share my demo. Additionally, Gradio also offers a hosting service where you can permanently host your demo for only 7$/month.
However, Gradio only works well for demoing a single function. Creating a custom site with additional features like gallery, login, or even just custom CSS is fairly limited or not possible at all.
One quick solution I could think of is by creating my demo site separate from the Gradio UI. Then, I can embed the Gradio UI on the site using the iframe element. I initially tried this method but then realized one important drawback, I cannot personalize any parts that need to interact with the ML app itself. For example, things such as input validation, custom progress bar, etc are not possible with iframe. This is when I decided to build an API instead.
Serving ML Model with FastAPI
I’ve been using FastAPI instead of Flask to quickly build my API. The main reason is I find FastAPI is faster to write (less code) and it also auto-generates documentation (using Swagger UI) that allows me to test the API with basic UI. Additionally, FastAPI supports asynchronous functions and is said to be faster than Flask.
Here is the code I wrote to serve my ML function as FastAPI server.
import clipit import torch from fastapi import FastAPI from fastapi.middleware.cors import CORSMiddleware from fastapi import FastAPI, File, UploadFile, Form, BackgroundTasks from fastapi.responses import FileResponse app = FastAPI() app.add_middleware( CORSMiddleware, allow_origins=['*'], allow_credentials=True, allow_methods=['*'], allow_headers=['*'], ) @app.get('/') async def root(): return {'hello': 'world'} @app.post("/generate") async def generate( seed: int = Form(None), iterations: int = Form(None), prompts: str = Form("Underwater City"), quality: str = Form("draft"), aspect: str = Form("square"), scale: float = Form(2.5), style: str = Form('image'), make_video: bool = Form(False), ): torch.cuda.empty_cache() clipit.reset_settings() use_pixeldraw = (style == 'Pixel Art') use_clipdraw = (style == 'Painting') clipit.add_settings(prompts=prompts, seed=seed, iterations=iterations, aspect=aspect, quality=quality, scale=scale, use_pixeldraw=use_pixeldraw, use_clipdraw=use_clipdraw, make_video=make_video) settings = clipit.apply_settings() clipit.do_init(settings) clipit.do_run(settings) return FileResponse('output.png', media_type="image/png")
Code for API server
Once we defined the server, we can run it using uvicorn. Additionally, because Google Colab only allows access to their server through the Colab interface, we have to use Ngrok to expose the FastAPI server to the public.
import nest_asyncio from pyngrok import ngrok import uvicorn ngrok_tunnel = ngrok.connect(8000) print('Public URL:', ngrok_tunnel.public_url) print('Doc URL:', ngrok_tunnel.public_url+'/docs') nest_asyncio.apply() uvicorn.run(app, port=8000)
Code to run and expose the server
Once we run the server, we can head to the Swagger UI (by adding /docs
on the generated ngrok URL) and test out the API.
While testing the API, I realized that the inference can takes about 3–20 mins depending on the quality/iterations. 3 mins itself is already considered very long for HTTP request and users may not want to wait that long on the site. I decided that setting the inference as a background task and emailing the user once the result is done might be more suitable for the task due to the long inference time.
Now that we decided on the plan, we first will write the function to send the email. I initially use SendGrid email API to do this, but after running out of the free usage quota (100 emails/day), I switched to Mailgun API since they are part of the GitHub Student Developer Pack and allows 20,000 emails/month for students.
So here is the code to send an email with an image attachment using Mailgun API.
import requests def email_results_mailgun(email, prompt): return requests.post("https://api.mailgun.net/v3/text2art.com/messages", auth=("api", "YOUR_MAILGUN_API_KEY"), files=[("attachment",("output.png", open("output.png", "rb").read() )), ("attachment", ("output.mp4", open("output.mp4", "rb").read() ))], data={"from": "Text2Art <YOUR_EMAIL>", "to": email, "subject": "Your Artwork is ready!", "text": f'Your generated arts using the prompt "{prompt}".', "html": f'Your generated arts using the prompt "{prompt}".'})
Code for sending an email with Mailgun API
Next, we will modify our server code to use background tasks in FastAPI and send the result through email in the background.
#@title API Functions import clipit import torch from fastapi import FastAPI from fastapi.middleware.cors import CORSMiddleware from fastapi import FastAPI, File, UploadFile, Form, BackgroundTasks from fastapi.responses import FileResponse app = FastAPI() app.add_middleware( CORSMiddleware, allow_origins=['*'], allow_credentials=True, allow_methods=['*'], allow_headers=['*'], ) # define function to be run as background tasks def generate(email, settings): clipit.do_init(settings) clipit.do_run(settings) prompt = " | ".join(settings.prompts) email_results_mailgun(email, prompt) @app.get('/') async def root(): return {'hello': 'world'} @app.post("/generate") async def add_task( email: str, background_tasks: BackgroundTasks, seed: int = Form(None), iterations: int = Form(None), prompts: str = Form("Underwater City"), quality: str = Form("draft"), aspect: str = Form("square"), scale: float = Form(2.5), style: str = Form('image'), make_video: bool = Form(False), ): torch.cuda.empty_cache() clipit.reset_settings() use_pixeldraw = (style == 'Pixel Art') use_clipdraw = (style == 'Painting') clipit.add_settings(prompts=prompts, seed=seed, iterations=iterations, aspect=aspect, quality=quality, scale=scale, use_pixeldraw=use_pixeldraw, use_clipdraw=use_clipdraw, make_video=make_video) settings = clipit.apply_settings() # Run function as background task background_tasks.add_task(generate, email, settings) return {"message": "Task is processed in the background"}
With the code above, the server will quickly reply to the request with the “Task is processed in the background” message instead of waiting for the generation process to finish and replying with the image.
Once the process is finished, the server will send the result by emailing the user.
Queue System with Firebase
Now that everything seems to be working, I built the front end and shared the site with my friends. However, I found that there was a concurrency problem when testing it out with multiple users.
When a second user makes a request to the server while the first task is still processing, somehow the second task will terminate the current process instead of creating a parallel process or queueing. I was not sure what caused this, maybe it was the use of global variables in the clipit code or maybe not. I did not spend too much time debugging it as I realized that I need to implement a message queue system instead.
After a few google searches on the message queue system, most recommend RabbitMQ or Redis. However, I was not sure whether RabbitMQ or Redis can be installed on Google Colab as it seems to require sudo
permission. In the end, I decided to use Google Firebase as a queue system instead as I wanted to finish the project ASAP and Firebase is the one I’m most familiar with.
Basically, when the user tries to generate an art in the frontend, it will add an entry in a collection namedqueue
describing the task (prompt, image type, size, etc). On the other hand, we will run a script on Google Colab that continuously listens for a new entry in the queue
collection and processes the task one by one.
import torch import clipit import time from datetime import datetime import firebase_admin from firebase_admin import credentials, firestore, storage if not firebase_admin._apps: cred = credentials.Certificate("YOUR_CREDENTIAL_FILE") firebase_admin.initialize_app(cred, { 'storageBucket': 'YOUR_BUCKET_URL' }) db = firestore.client() bucket = storage.bucket() def generate(doc_id, prompt, quality, style, aspect, email): torch.cuda.empty_cache() clipit.reset_settings() use_pixeldraw = (style == 'pixel art') use_clipdraw = (style == 'painting') clipit.add_settings(prompts=prompt, seed=seed, aspect=aspect, quality=quality, use_pixeldraw=use_pixeldraw, use_clipdraw=use_clipdraw, make_video=True) settings = clipit.apply_settings() clipit.do_init(settings) clipit.do_run(settings) data = { "seed": seed, "prompt": prompt, "quality": quality, "aspect": aspect, "type": style, "user": email, "created_at": datetime.now() } db.collection('generated_images').document(doc_id).set(data) email_results_mailgun(email, prompt) transaction = db.transaction() @firestore.transactional def claim_task(transaction, queue_objects_ref): # query firestore queue_objects = queue_objects_ref.stream(transaction=transaction) # pull the document from the iterable next_item = None for doc in queue_objects: next_item = doc # if queue is empty return status code of 2 if not next_item: return {"status": 2} # get information from the document next_item_data = next_item.to_dict() next_item_data["status"] = 0 next_item_data['id'] = next_item.id # delete the document and return the information transaction.delete(next_item.reference) return next_item_data # initialize query queue_objects_ref = ( db.collection("queue") .order_by("created_at", direction="ASCENDING") .limit(1) ) transaction_attempts = 0 while True: try: # apply transaction next_item_data = claim_task(transaction, queue_objects_ref) if next_item_data['status'] == 0: generate(next_item_data['id'], next_item_data['prompt'], next_item_data['quality'], next_item_data['type'], next_item_data['aspect'], next_item_data['email']) print(f"Generated {next_item_data['prompt']} for {next_item_data['email']}") except Exception as e: print(f"Could not apply transaction. Error: {e}") time.sleep(5) transaction_attempts += 1 if transaction_attempts > 20: db.collection("errors").add({ "exception": f"Could not apply transaction. Error: {e}", "time": str(datetime.now()) }) exit()
Backend code that processes the task and listens to the queue continuously
In the front end, we only have to add a new task in the queue. But make sure you have done a proper Firebase setup on your front end.
db.collection("queue").add({ prompt: prompt, email: email, quality: quality, type: type, aspect: aspect, created_at: firebase.firestore.FieldValue.serverTimestamp(), })
And it’s done! Now, when a user tries to generate art in the frontend, it will add a new task in the queue. The worker script in the Colab server will then process the tasks in the queue one by one.
You can check out the GitHub repo to see the full code (feel free to star the repo).
References
[1] https://openai.com/blog/dall-e/
[2] https://openai.com/blog/clip/
[3] https://ljvmiranda921.github.io/notebook/2021/08/08/clip-vqgan/
[4] https://github.com/orpatashnik/StyleCLIP
[5] https://towardsdatascience.com/understanding-flask-vs-fastapi-web-framework-fe12bb58ee75
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.