16.3 C
New York

AI-Powered Emoji Search in 50+ languages 😊🌍🚀

Develop an AI-powered semantic search for emojis using Python and open-source NLP librariesIf you are on social media like Twitter or LinkedIn, you have probably noticed that emojis are creatively used in both informal and professional text-based communication. For example, the Rocket emoji 🚀 is often used on LinkedIn to symbolize high aspirations and ambitious goals, while the Bullseye 🎯 emoji is used in the context of achieving goals. Despite this growth of creative emoji use, most social media platforms lack a utility that assists users in choosing the right emoji to effectively communicate their message. I therefore decided to invest some time to work on a project I called Emojeez 💎, an AI-powered engine for emoji search and retrieval. You can experience Emojeez 💎 live using this fun interactive demo.In this article, I will discuss my experience and explain how I employed advanced natural language processing (NLP) technologies to develop a semantic search engine for emojis. Concretely, I will present a case study on embedding-based semantic search with the following stepsHow to use LLMs 🦜to generate semantically rich emoji descriptionsHow to use Hugging Face 🤗 Transformers for multilingual embeddingsHow to integrate Qdrant 🧑🏻‍🚀 vector database to perform efficient semantic searchI made the full code for this project available on GitHub.Every new idea often begins with a spark of inspiration. For me, the spark came from Luciano Ramalho’s book Fluent Python. It is a fantastic read that I highly recommend for anyone who likes to write truly Pythonic code. In chapter 4 of his book, Luciano shows how to search over Unicode characters by querying their names in the Unicode standards. He created a Python utility that takes a query like “cat smiling” and retrieves all Unicode characters that have both “cat” and “smiling” in their names. Given the query “cat smiling”, the utility retrieves three emojis: 😻, 😺, and 😸. Pretty cool, right?From there, I started thinking how modern AI technology could be used to build an even better emoji search utility. By “better,” I envisioned a search engine that not only has better emoji coverage but also supports user queries in multiple languages beyond English.If you are an emoji enthusiast, you know that 😻, 😺, and 😸 aren’t the only smiley cat emojis out there. Some cat emojis are missing, notably 😸 and 😹. This is a known limitation of keyword search algorithms, which rely on string matching to retrieve relevant items. Keyword, or lexical search algorithms, are known among information retrieval practitioners to have high precision but low recall. High precision means the retrieved items usually match the user query well. One the other hand, low recall means the algorithm might not retrieve all relevant items. In many cases, the lower recall is due to string matching. For example, the emoji 😹 does not have “smiling” in its name — cat with tears of joy. Therefore, it cannot be retrieved with the query “cat smiling” if we search for both terms cat and smiling in its name.Another issue with lexical search is that it is usually language-specific. In Luciano’s Fluent Python example, you can’t find emojis using a query in another language because all Unicode characters, including emojis, have English names. To support other languages, we would need to translate each query into English first using machine translation. This will add more complexity and might not work well for all languages.But hey, it’s 2024 and AI has come a long way. We now have solutions to address these limitations. In the rest of this article, I will show you how.In recent years, a new search paradigm has emerged with the popularity of deep neural networks for NLP. In this paradigm, the search algorithm does not look at the strings that make up the items in the search database or the query. Instead, it operates on numerical representations of text, known as vector embeddings. In embedding-based search algorithms, the search items, whether text documents or visual images, are first converted into data points in a vector space such that semantically relevant items are nearby. Embeddings enable us to perform similarity search based on the meaning of the emoji description rather than the keywords in its name. Because they retrieve items based on semantic similarity rather than keyword similarity, embedding-based search algorithms are known as semantic search.Using semantic search for emoji retrieval solves two problems:We can go beyond keyword matching and use semantic similarity between emoji descriptions and user queries. This improves the coverage of the retrieved emojis, leading to higher recall.If we represent emojis as data points in a multilingual embedding space, we can enable user queries written in languages other than English, without needing translation into English. That is very cool, isn’t it? Let’s see how 👀If you use social media, you probably know that many emojis are almost never used literally. For example, 🍆 and 🍑 rarely denote an eggplant and peach. Social media users are very creative in assigning meanings to emojis that go beyond their literal interpretation. This creativity limits the expressiveness of emoji names in the Unicode standards. A notable example is the 🌈 emoji, which is described in the Unicode name simply as rainbow, yet it is commonly used in contexts related to diversity, peace, and LGBTQ+ rights.To build a useful search engine, we need a rich semantic description for each emoji that defines what the emoji represents and what it symbolizes. Given that there are more than 5000 emojis in the current Unicode standards, doing this manually is not feasible. Luckily, we can employ Large Language Models (LLMs) to assist us in generating metadata for each emoji. Since LLMs are trained on the entire web, they have likely seen how each emoji is used in context.For this task, I used the 🦙 Llama 3 LLM to generate metadata for each emoji. I wrote a prompt to define the task and what the LLM is expected to do. As illustrated in the figure below, the LLM generated a rich semantic description for the Bullseye 🎯 emoji. These descriptions are more suitable for semantic search compared to Unicode names. I released the LLM-generated descriptions as a Hugging Face dataset.Using Llama 3 LLM for generating enriched semantic descriptions for emojis.Now that we have a rich semantic description for each emoji in the Unicode standard, the next step is to represent each emoji as a vector embedding. For this task, I used a multilingual transformer based on the BERT architecture, fine-tuned for sentence similarity across 50 languages. You can see the supported languages in the model card in the Hugging Face 🤗 library.So far, I have only discussed the embedding of emoji descriptions generated by the LLM, which are in English. But how can we support languages other than English?Well, here’s where the magic of multilingual transformers comes in. The multilingual support is enabled through the embedding space itself. This means we can take user queries in any of the 50 supported languages and match them to emojis based on their English descriptions. The multilingual sentence encoder (or embedding model) maps semantically similar text phrases to nearby points in its embedding space. Let me show you what I mean with the following illustration.A visual illustration of the multilingual embedding space where sentences and phrases are geometrically organized based on their semantic similarity regardless of the text language.In the figure above, we see that semantically similar phrases end up being data points that are nearby in the embedding space, even if they are expressed in different languages.Once we have our emojis represented as vector embeddings, the next step is to build an index over these embeddings in a way that allows for efficient search operations. For this purpose, I chose to use Qdrant, an open-source vector similarity search engine that provides high-performance search capabilities.Setting up Qdrant for this task is a simple as the code snippet below (you can also check out this Jupyter Notebook).# Load the emoji dictionary from a pickle filewith open(file_path, ‘rb’) as file:emoji_dict: Dict[str, Dict[str, Any]] = pickle.load(file)# Setup the Qdrant client and populate the databasevector_DB_client = QdrantClient(“:memory:”)embedding_dict = {emoji: np.array(metadata[’embedding’]) for emoji, metadata in emoji_dict.items()}# Remove the embeddings from the dictionary so it can be used # as payload in Qdrantfor emoji in list(emoji_dict):del emoji_dict[emoji][’embedding’]embedding_dim: int = next(iter(embedding_dict.values())).shape[0]# Create a new collection in Qdrantvector_DB_client.create_collection(collection_name=”EMOJIS”,vectors_config=models.VectorParams(size=embedding_dim, distance=models.Distance.COSINE),)# Upload vectors to the collectionvector_DB_client.upload_points( collection_name=”EMOJIS”,points=[models.PointStruct(id=idx, vector=embedding_dict[emoji].tolist(),payload=emoji_dict[emoji])for idx, emoji in enumerate(emoji_dict)],)Now the search index vector_DB_client is ready to take queries. All we need to do is to transform the coming user query into a vector embedding using the same embedding model we used to embed the emoji descriptions. This can be done through the function below.def retrieve_relevant_emojis(embedding_model: SentenceTransformer,vector_DB_client: QdrantClient,query: str, num_to_retrieve: int) -> List[str]:”””Return emojis relevant to the query using sentence encoder and Qdrant. “””# Embed the queryquery_vector = embedding_model.encode(query).tolist()hits = vector_DB_client.search(collection_name=”EMOJIS”,query_vector=query_vector,limit=num_to_retrieve,)return hitsTo further show the retrieved emojis, their similarity score with the query, and their Unicode names, I wrote the following helper function.def show_top_10(query: str) -> None:”””Show emojis that are most relevant to the query.”””emojis = retrieve_relevant_emojis(sentence_encoder, vector_DB_clinet, query, num_to_retrieve=10)for i, hit in enumerate(emojis, start=1):emoji_char = hit.payload[‘Emoji’]score = hit.scorespace = len(emoji_char) + 3unicode_desc = ‘ ‘.join(em.demojize(emoji_char).split(‘_’)).upper()print(f”{i:<3} {emoji_char:<{space}}”, end=”)print(f”{score:<7.3f}”, end= ”)print(f”{unicode_desc[1:-1]:<55}”)Now everything is set up, and we can look at a few examples. Remember the “cat smiling” query from Luciano’s book? Let’s see how semantic search is different from keyword search.show_top_10(‘cat smiling’)>> 1 😼 0.651 CAT WITH WRY SMILE 2 😸 0.643 GRINNING CAT WITH SMILING EYES 3 😹 0.611 CAT WITH TEARS OF JOY 4 😻 0.603 SMILING CAT WITH HEART-EYES 5 😺 0.596 GRINNING CAT 6 🐱 0.522 CAT FACE 7 🐈 0.513 CAT 8 🐈‍⬛ 0.495 BLACK CAT 9 😽 0.468 KISSING CAT 10 🐆 0.452 LEOPARDAwesome! Not only did we get the expected cat emojis like 😸, 😺, and 😻, which the keyword search retrieved, but it also the smiley cats 😼, 😹, 🐱, and 😽. This showcases the higher recall, or higher coverage of the retrieved items, I mentioned earlier. Indeed, more cats is always better!The previous “cat smiling” example shows how embedding-based semantic search can retrieve a broader and more meaningful set of items, improving the overall search experience. However, I don’t think this example truly shows the power of semantic search.Imagine looking for something but not knowing its name. For example, take the 🧿 object. Do you know what it’s called in English? I sure didn’t. But I know a bit about it. In Middle Eastern and Central Asian cultures, the 🧿 is believed to protect against the evil eye. So, I knew what it does but not what it’s called.Let’s see if we can find the emoji 🧿 with our search engine by describing it using the query “protect from evil eye”.show_top_10(‘protect from evil eye’)>> 1 🧿 0.409 NAZAR AMULET 2 👓 0.405 GLASSES 3 🥽 0.387 GOGGLES 4 👁 0.383 EYE 5 🦹🏻 0.382 SUPERVILLAIN LIGHT SKIN TONE 6 👀 0.374 EYES 7 🦹🏿 0.370 SUPERVILLAIN DARK SKIN TONE 8 🛡️ 0.369 SHIELD 9 🦹🏼 0.366 SUPERVILLAIN MEDIUM-LIGHT SKIN TONE 10 🦹🏻‍♂ 0.364 MAN SUPERVILLAIN LIGHT SKIN TONE And Viola! It turns out that the 🧿 is actually called Nazar Amulet. I learned something new 😄One of the features I really wanted for this search engine to have is for it to support as many languages besides English as possible. So far, we have not tested that. Let’s test the multilingual capabilities using the description of the Nazar Amulet 🧿 emoji by translating the phrase “protection from evil eyes” into other languages and using them as queries one language at a time. Here are the result below for some languages.show_top_10(‘يحمي من العين الشريرة’) # Arabic>>1 🧿 0.442 NAZAR AMULET 2 👓 0.430 GLASSES 3 👁 0.414 EYE 4 🥽 0.403 GOGGLES 5 👀 0.403 EYES 6 🦹🏻 0.398 SUPERVILLAIN LIGHT SKIN TONE 7 🙈 0.394 SEE-NO-EVIL MONKEY 8 🫣 0.387 FACE WITH PEEKING EYE 9 🧛🏻 0.385 VAMPIRE LIGHT SKIN TONE 10 🦹🏼 0.383 SUPERVILLAIN MEDIUM-LIGHT SKIN TONEshow_top_10(‘Vor dem bösen Blick schützen’) # Deutsch >>1 😷 0.369 FACE WITH MEDICAL MASK 2 🫣 0.364 FACE WITH PEEKING EYE 3 🛡️ 0.360 SHIELD 4 🙈 0.359 SEE-NO-EVIL MONKEY 5 👀 0.353 EYES 6 🙉 0.350 HEAR-NO-EVIL MONKEY 7 👁 0.346 EYE 8 🧿 0.345 NAZAR AMULET 9 💂🏿‍♀️ 0.345 WOMAN GUARD DARK SKIN TONE 10 💂🏿‍♀ 0.345 WOMAN GUARD DARK SKIN TONEshow_top_10(‘Προστατέψτε από το κακό μάτι’) #Greek>>1 👓 0.497 GLASSES 2 🥽 0.484 GOGGLES 3 👁 0.452 EYE 4 🕶️ 0.430 SUNGLASSES 5 🕶 0.430 SUNGLASSES 6 👀 0.429 EYES 7 👁️ 0.415 EYE 8 🧿 0.411 NAZAR AMULET 9 🫣 0.404 FACE WITH PEEKING EYE 10 😷 0.391 FACE WITH MEDICAL MASKshow_top_10(‘Защитете от лошото око’) # Bulgarian>>1 👓 0.475 GLASSES 2 🥽 0.452 GOGGLES 3 👁 0.448 EYE 4 👀 0.418 EYES 5 👁️ 0.412 EYE 6 🫣 0.397 FACE WITH PEEKING EYE 7 🕶️ 0.387 SUNGLASSES 8 🕶 0.387 SUNGLASSES 9 😝 0.375 SQUINTING FACE WITH TONGUE 10 🧿 0.373 NAZAR AMULETshow_top_10(‘防止邪眼’) # Chinese>>1 👓 0.425 GLASSES 2 🥽 0.397 GOGGLES 3 👁 0.392 EYE 4 🧿 0.383 NAZAR AMULET 5 👀 0.380 EYES 6 🙈 0.370 SEE-NO-EVIL MONKEY 7 😷 0.369 FACE WITH MEDICAL MASK 8 🕶️ 0.363 SUNGLASSES 9 🕶 0.363 SUNGLASSES 10 🫣 0.360 FACE WITH PEEKING EYEshow_top_10(‘邪眼から守る’) # Japanese >>1 🙈 0.379 SEE-NO-EVIL MONKEY 2 🧿 0.379 NAZAR AMULET 3 🙉 0.370 HEAR-NO-EVIL MONKEY 4 😷 0.363 FACE WITH MEDICAL MASK 5 🙊 0.363 SPEAK-NO-EVIL MONKEY 6 🫣 0.355 FACE WITH PEEKING EYE 7 🛡️ 0.355 SHIELD 8 👁 0.351 EYE 9 🦹🏼 0.350 SUPERVILLAIN MEDIUM-LIGHT SKIN TONE 10 👓 0.350 GLASSESFor languages as diverse as Arabic, German, Greek, Bulgarian, Chinese, and Japanese, the 🧿 emoji always appears in the top 10! This is pretty fascinating since these languages have different linguistic features and writing scripts, thanks to the massive multilinguality of our 🤗 sentence Transformer.The last thing I want to mention is that no technology, no matter how advanced, is perfect. Semantic search is great for improving the recall of information retrieval systems. This means we can retrieve more relevant items even if there is no keyword overlap between the query and the items in the search index. However, this comes at the expense of precision. Remember from the 🧿 emoji example that in some languages, the emoji we were looking for didn’t show up in the top 5 results. For this application, this is not a big problem since it’s not cognitively demanding to quickly scan through emojis to find the one we desire, even if it’s ranked at the 50th position. But in other cases such as searching through long documents, users may not have the patience nor the resources to skim through dozens of documents. Developers need to keep in mind user cognitive as well as resource constraints when building search engines. Some of the design choices I made for the Emojeez 💎 search engine may not be work as well for other applications.Another thing to mention is that AI models are known to learn socio-cultural biases from their training data. There is a large volume of documented research showing how modern language technology can amplify gender stereotypes and be unfair to minorities. So, we need to be aware of these issues and do our best to tackle them when deploying AI in the real world. If you notice such unwanted biases and unfair behaviors in Emojeez 💎, please let me know and I will do my best to address them.Working on the Emojeez 💎 project was a fascinating journey that taught me a lot about how modern AI and NLP technologies can be employed to address the limitations of traditional keyword search. By harnessing the power of Large Language Models for enriching emoji metadata, multilingual transformers for creating semantic embeddings, and Qdrant for efficient vector search, I was able to create a search engine that makes emoji search more fun and accessible across 50+ languages. Although this project focuses on emoji search, the underlying technology has potential applications in multimodal search and recommendation systems.For readers who are proficient in languages other than English, I am particularly interested in your feedback. Does Emojeez 💎 perform equally well in English and your native language? Did you notice any differences in quality or accuracy? Please give it a try and let me what you think. Your insights are quite invaluable.Thank you for reading, and I hope you enjoy exploring Emojeez 💎 as much as I enjoyed building it.Happy Emoji search! 📆😊🌍🚀Note: Unless otherwise noted, all images are created by the author.

Related articles

Recent articles