This article was cross-posted on Medium: https://medium.com/@chankhavu/playing-dixit-with-chatgpt-creative-image-captioning-is-solved-599e11f2da79
Have you ever played Dixit? It’s a popular party game that features a deck of cards with surreal and imaginative illustrations. During each turn, a player takes on the role of “storyteller” and provides a brief, imaginative clue to describe the card they have selected. The other players then select a card from their hand that they believe most closely matches the description, with the aim of deceiving others into thinking it was the storyteller’s card. Finally, players must guess which card the storyteller chose.
This game requires players to use their creativity and imagination. Sounds like something that AI excels at, right? Let’s take a look at one of its plays:
During this round, a human storyteller selected a card depicting a large monster dropping a child into a maze and provided the hint “you can do it.” Upon the reveal of the cards… Wow! What just happened?! The bot accurately guessed the storyteller’s card, while the other players believed the storyteller had chosen a card that the AI chose — depicting an old man fishing with no success, despite an abundance of fish in the pond. This resulted in a perfect score for the round, showcasing the AI’s complete domination over the humans (for a brief moment in time)!
For those curious, I served as the AI’s human assistant, responsible for physically carrying out actions such as picking up, laying out, and capturing photos of the cards. The AI’s creator became its first subordinate! Oh, the irony…
Okay, it’s quite good at choosing a matching card for the given clue, but can it generate creative and imaginative captions for the cards? Hell yeah it can! Here are a few examples:
Just a year ago, this level of creativity and imaginative thinking was unheard of. It operates just like a genuine human! Even more impressive, you can prompt it with various personalities — a movie buff, an anime nerd, a programmer, a musician, and have it generate themed clues! The potential for enhancements and modifications is limitless.
If you want to see more examples from a real game with real people (and with real pizza 🍕🍕🍕), scroll down to the end of this blog post 😉.
In this article, you may notice that I refer to “ChatGPT” (which sometimes is also called “GPT-3.5”) and “GPT-3” interchangeably. It’s just a click bait 😛 “ChatGPT” has become a popular catchphrase recently and the use of this term can attract more attention and increase the ranking of the article (oh yeah, I’ll sell my soul for likes and upvotes). All examples here were generated using the public model text-davinci-003.
Inner workings. Spelled out.
A reader in February 2023 may feel puzzled: large language models like GPT-3 cannot see images (yet), how could it possibly play Dixit? Easy! Just give it access to image captioning and VQA models! To make the whole pipeline more robust, we can even use multiple models at the same time.
Here are the models that I used to build this Dixit AI:
- text-davinci-003 — The most capable OpenAI GPT-3 model available to the public to date, serving as the primary “brain” behind our Dixit AI. The outputs of all other models will be fed into it.
- BLIP-2 — A recently published state-of-the-art image captioning and visual question-answering model.
- GIT large — Microsoft’s previous state-of-the-art image captioning model, published just a month prior.
- BLIP-1 large — Despite being published a year ago, it still produces reliable results for images with easily recognizable objects.
- CLIP — While it’s a relatively outdated visual-language embedding model, it has minimal importance in my pipeline.
- Azure object detection service — I have included this model as it can sometimes be useful for object recognition.
All these models are seamlessly integrated by LangChain, an incredible library for prompting LLMs and augmenting them with other tools — in this case, the visual-language models that enable GPT-3 to “see” the world.
Generate detailed image description
Here is the complete process for generating detailed image descriptions using GPT-3, BLIP-2, and other Visual-Language models:
- Step 1: Generate Captions with BLIP-2, GIT-large, BLIP-1, and Azure Object Detection Service. Combine all descriptions into one, then label GIT-large and BLIP-1 as “Less Trustworthy” and BLIP-2 as “Highly Trustworthy.” This gives GPT-3 a granular differentiation between reliable and unreliable sources.
- Step 2: Prompt GPT-3 to consider what details in the image it wants to clarify while providing context about the Dixit game to prevent it from asking unhelpful questions.
- Step 3: Let GPT-3 talk with BLIP-2 😁. Observing the conversation between the blind but intelligent GPT-3 and the sighted but naive BLIP-2 as they make sense of the Dixit card is the most entertaining aspect of the pipeline.
- Step 4: Provide all chat history to GPT-3 and have it generate a detailed description of the Dixit card. It may hallucinate additional details, but in the context of the Dixit game, this is not a bug but a desirable feature.
Why not just let BLIP-2 generate all the descriptions directly, without GPT-3 interpretation step? I tried, but usually, it’s either too short (I tried to force larger min_length, larger repetition_penalty, use beam_search, and other parameters to no avail), contains mistakes (especially on more abstract cards), or simply too dry (GPT-3 can hallucinate out the “mood” of the image, which is quite handy in Dixit). Would love to hear your suggestions.
One might also question the use of older image captioning models despite having BLIP-2 which is lightyears ahead of everything else? It is because I’m implicitly using the concept of “consistency” — By prompting GPT-3 to be skeptical of individual models, we ensure that if all models agree on something, it’s more trustworthy. Dixit is a visually challenging game, and even BLIP-2 doesn’t always provide accurate descriptions. Hence, relying on multiple models helps to cover its weaknesses.
Generate a creative clue for a card
In the storyteller’s round, the bot has to select a card and describe it with an imaginative clue. To make the AI generate a creative description for a given image, I prompted the GPT-3 model with different personalities. I provide it with the detailed image description generated in the previous step and encourage it to explicitly explain its thoughts about what the image reminds it about.
Finally, after a long chain of prompts, I ask GPT-3 to summarize all of its inner thoughts scratchpad with a single short clue.
Guess the card by given clue
The most complicated prompt chain of the Dixit AI is in the card-guessing stage. For each card, whether from the bot’s hand or other players’ piles, I generate a comprehensive explanation as to why the card may be related to the given clue using the following process:
- First, I create a detailed generic image description through the pipeline outlined in the previous section. This description does not contain the clue as prior so it can be generated before the game.
- Next, I prompt GPT-3 to explain how this image may be related to the given clue and permit it to interact with BLIP-2 if it needs to clear up any details. Yes, the two AI models are talking to each other again!
- I then concatenate the detailed explanation from previous step with the cosine similarity score of image-clue CLIP embeddings into a tuple.
The inclusion of the cosine similarity score of the CLIP embeddings in the final explanation is grounded on the observation that players may approach the game of Dixit in different ways. Sometimes they give clues based on a logical chain of abstractions, while other times they provide purely visual descriptions. That’s why I thought the bot would need both chained reasoning and visual-language similarity.
Finally, I politely ask GPT-3 to find the most plausible explanation, taking into consideration the Softmax probabilities of CLIP similarity scores with lowered temperature, and choose the best card that fits the clue (the schema below also shows a situation from an actual game):
The prompts in reality are much longer than in the illustrations above. Creating an appropriate prompt that strikes the right balance between being too general and too specific involves a significant amount of experimentation and refinement.
Wrapping everything into a Telegram bot
The only reason I make small projects like this is to have fun with people. I needed some sort of a front-end to smoothly interact with this AI through my phone, while the main script is hosted on an Azure instance with Nvidia A100 GPU. Telegram was a natural choice.
Detect Dixit cards on a photo
Taking photo of each card during a real game is too slow (I don’t want my friends to wait forever), so I wrote an OpenCV script to find multiple Dixit cards on a photo:
This detector was written purely with OpenCV. First, I utilized Bilateral Filtering to eliminate potential background textures (i.e. wooden table) while preserving the edges. Next, I applied Adaptive Thresholding to highlight high-contrast regions and utilized the Canny detector to identify edges. After several morphology operations to reduce noise and reinforce more crucial regions, I employed the Probabilistic Hough Transform and identified the Contours Hierarchy in the processed image. Finally, for each top-level contour, I fitted a polynomial hull and determined if it is a quadrilateral. If necessary, I also performed perspective transformations on the detected cards. In the age of Deep Learning, traditional Computer Vision still matters!
Yes, I could have made this detection pipeline simpler. But I needed this detector to be absolutely bulletproof and work under any lighting conditions, so I can gather my friends and have a smooth board game night.
Telegram bot interface
The following screenshots were taken during a real Dixit game:
The telegram bot interface is quite minimalistic, as you can see from the screenshots above, with commands like /add
(to add the cards to our hand), /status
to display images in our hand and their short clue, /clue
to choose a card based on a given clue, and other operations.
Full source code of this Dixit AI: https://github.com/hav4ik/dixit-chatgpt.
Dixit in AI research
To my surprise, there were a few scientific articles published around Dixit:
- [2010.00048] Creative Captioning: An AI Grand Challenge Based on the Dixit Board Game (arxiv.org) — the authors proposed Dixit as a new “AI Grand Challenge”. Not so “grand” now, eh?
- An Internet-assisted Dixit-playing AI | FDG’22 (acm.org) — they built a retrieval-based bot that plays the guessing phase of the game. I doubt that it works as well as they claim though.
- [2206.08349] Know your audience: specializing grounded language models with the game of Dixit (arxiv.org) — a really cool paper by Deepmind, studying the personalized clue aspect of the game.
How the game went
Here is how our game looked like, with me being the subordinate of GPT-3 and responsible for executing physical actions on its behalf:
AI generating clues
Here are a few more examples of the clues generated by the AI during this game that it played with my friends:
AI choosing a card that matches the clue
In the beginning of the round when another person is a storyteller, the AI has to choose one of the cards on its hand that matches the given clue. The examples below were not cherry-picked.
The AI’s success rate is pretty close to human level though — in many rounds, the majority’s guess agreed with the AI’s guess. The limiting factor here is the BLIP-2 model and the whole image description pipeline.
Final thoughts
In the future, large language models are expected to become multi-modal, as evidenced by recent developments such as Flamingo and BLIP-2. The tricks outlined in this article will likely become obsolete by 2024 or even in the latter half of 2023. Nevertheless, it is fascinating to observe the capabilities of GPT-3, while having so limited ability to interact with the outer world through other modalities.
By the way, ChatGPT helped me to write most of this blog post as well 😉 Oh, and Github Copilot helped me to write most of the code.