Natural Language Queries for Magic The Gathering Cards Using BERT + FAISS

Posted on December 22, 2023

Overview

A coworker has been working on a project to assist Magic The Gathering (MTG) players with building and enhancing their decks. Full disclosure, I know nothing about MTG but I do have a little bit of experience working with Natural Language Query (NLQ) processing thanks to a capstone project I worked on in 2020 while completing my Computer Science Degree at Kent State University. My coworker was struggling with some of the AI aspects of his project so I decided to create a simple example to demonstrate how how to utilize the BERT model and Facebook's FAISS to create an efficient similarity search to help MTG players quickly find cards that suit their needs.

We're utilizing BERT for our language processing to create embeddings of our MTG card descriptions (these are referred to as oracle texts) and our users Natural Language Query. We will also utilize FAISS (Facebook Artificial Intelligence Similarity Search) to efficiently find the most similar results to our users NLQ.

Prerequisites

Before diving into the code, ensure you have the following prerequisites:

Python 3.x
PyTorch
Hugging Face's Transformers library
FAISS (Facebook AI Similarity Search) library
Numpy

You can install the necessary libraries using pip:

Imports and Data Preparation

First we need to import the packages we installed and set up our database of MTG Card oracle texts. For this example we're just going to have an array of description strings.

User's Natural Language Query

Next we need to get a natural language query from the user. In our example this is just a hardcodes string but you could easily set this up to accept actual user input!

Setting up the BERT Model and Tokenizer

The BERT (Bidirectional Encoder Representations from Transformers) is a powerful Natural Language Processing model. We will use it to convert the card texts and the query into meaningful numerical representations called embeddings that let us preform mathematical operations on them like similarity searches. In the following code, we're setting up the BERT model that will do our processing as well as the tokenizer which will split our natural language query and card descriptions into chunks of text which are required for model to process them into embeddings.

Generating Embeddings

Now that we have our model and tokenizer set up, we can create our embeddings for our cards. Embeddings are vector representations of words, sentences, or documents. In NLP, they are used to capture the semantic meanings of texts. Each dimension of an embedding vector represents a latent feature of the text, often capturing aspects of its syntax and semantics. In simpler terms, they capture the aspect of text that is implied in a given context such as positive or negative sentiments, general tone like anger, excitement, or formality, etc.

In the code below we're utilizing PyTorch and its tensors which are multi dimensional arrays representing our embeddings to interact with the BERT model. We're also utilizing NumPy to covert the embeddings from the PyTorch tensors into NumPy arrays because FAISS operates on NumPy arrays, not PyTorch tensors. The embeddings are stored as a matrix for efficient computation and manipulation.

Building the FAISS Index

We're utilizing Facebook Artificial Intelligence Similarity Search (FAISS) to be able to efficiently perform similarity searches between our users NLQ and the embeddings generated from our MTG card descriptions. FAISS is very efficient at performing similarity searches on high dimensional data like those of the embeddings created by the BERT model.

In the following code, we are converting the embeddings from pytorch into a format that FAISS can utilize which is a numpy array. Next, we grab the dimensionality. The dimensionality is simply the number of distinct pieces of information each vector holds. For BERT models, this is typically 768 which means that for each word or text segment there is a vector consisting of 768 distinct values. For any given natural language query, that's a lot of values! That's why BERT and similar models are considered 'high dimensionality' models. That's also why we need to utilize FAISS as it is specifically designed for fast and efficient similarity searches when working with high dimensional vectors.

Once we have the dimensionality of our vectors we can actually create our FAISS index. In our example we are using the L2 (Euclidean) distance index which is a basic and efficient index that computes the distance between vectors as a straight line in Euclidean space. Don't worry if that sounds super complicated, we're essentially using the Euclidean Distance Formula (that you can find online if you're curious) and using it to compare the numerical values/points of our NLQ vector to those of each vector in our card descriptions embeddings. The one with the smallest value from the results of that formula computation are considered to be the most similar while the ones that have a larger value are less similar.

Finally, now that we have created our index we can insert our embeddings matrix into the index.

Creating Embeddings for the NLQ

Next up, we need to create embeddings for the users natural language query in a similar manner to how we made them for the card descriptions using BERT. This process is nearly identical to the card description embeddings except we only have one string instead of multiple.

Performing the Similarity Search

Now that we have our embeddings created and our FAISS index set up, we can actually perform our similarity search using our index on our embeddings. In the code below, we're using k to represent the number of results we want to show. We have it set to 5 which will show us the 5 descriptions that are most similar to our natural language query.

We will again need to convert our PyTorch tensor representation of our NLQ into a NumPy array for use with FAISS. FAISS also requires that our NumPy array for our query has the same dimensionality as our description embeddings. Currently our query embeddings have 1 dimension but our description embeddings have 2 so we need to reshape our query embeddings to match. We're passing (1, -1) to the reshape method which will create 1 row and as many columns as needed to make our reshape operation work. This effectively turns our 1D array into a 2D array with only 1 row and x number of columns (since we're using BERT, x will be 768 columns).

Once we have our embeddings as a NumPy array with the correct dimensionality, we can now utilize it as the input for our FAISS search.

Printing the Results

Our last step is to take the result of our index search that we returned to the variable called indicies and use these indicies to access the corresponding element in our cards array. The indicies variable is a 2D array returned by FAISS so in order to access each index value, we access the first element of the indicies variable. This is an array of index values that we can iterate over. If we use each of these from the square brackets operator on the cards array and print it, we will see what our top k most similar card descriptions are.

Conclusion

This was a fun project to implement and overall I'm happy with the results! Projects like this I feel are a really good introduction into the world of artificial intelligence for someone with some programming background. This whole project comes in at just under 100 lines of code and produces reasonable results without a lot of complexity. Leave a comment if you found this interesting!