Building a Semantic Search CLI with Vector Databases, OpenAI, and Go

Building a Semantic Search CLI with Vector Databases, OpenAI, and Go

ยท

9 min read

Introduction

Hello everyone! In this article I'm going to be building a CLI that enables you to:

  1. Add new article headlines

  2. Search for headlines related to a certain subject (semantic search)

We'll be using OpenAI's API, A free vector database & we'll write the code using Go. Let's start!

Semantic Search

What does it mean?

Semantic search is a way for computers to understand what you're really looking for when you type something into a search bar. Instead of just looking at the exact words you typed, it tries to understand the meaning behind your words.

For example, if you search for "apple", a regular search might just look for pages with the word "apple". But a semantic search will try to figure out if you're looking for the fruit, the tech company, or something else, based on other words you typed or things you've searched for in the past. It's like a smarter, more understanding search.

How is it achievable?

Here comes the role of Vector Databases, Vector databases are a special kind of database that can store and search for data in the form of vectors. Vectors are like a mathematical representation of data. In the context of semantic search, we often convert words or phrases into vectors using techniques like word embeddings.

Word embeddings

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

Word embeddings map words in a vocabulary to vectors of real numbers. For example, the word "king" might be represented by the vector [0.1, 0.3, -0.2, ..., 0.8] in a 300-dimensional space. These vectors are chosen so that they resemble the words' meanings. Words that are used and occur in the same contexts tend to have similar vectors. For example, "king" and "queen" might have vectors that are very close together and far away from "carrot".

These word vectors are learned by training a machine learning model on a large amount of text data. The model's goal is to predict a word given its context (the words around it), and in doing so it learns these word vectors. Hence the use of OpenAI's API in this article.

Riding the AI Wave with Vector Databases: How they work (and why VCs love  them) - LunaBrain

In the figure above every color represents some semantically related words for example the closer they are to each other the similar they are.

What actually happens

When you type a search query, we convert your words into vectors. We then compare these vectors with the vectors we have stored in the database. The idea is that similar words or phrases will have similar vectors. So, by finding vectors in the database that are close to your search query's vector, we can find results that are semantically similar to what you're looking for.

This way, even if the exact words in your search query don't appear in a document, the document can still be a search result if it's semantically similar to your query. This is how we achieve semantic search with vector databases.

Walkthrough

Requirements

Before moving forward make sure you have the following tools ready:

  1. An Api Key from OpenAI with quota available. more info: https://platform.openai.com/api-keys

  2. An account created on SingleStore which we'll be using as our vector database. This video has a walkthrough on how to setup the database for free.

  3. Go installed on your local environment.

Make sure to read the comments in every code snippet I provide below

Integrating OpenAI with Go

First things first we'll be creating a Go struct that abstracts all the interaction with OpenAI.

Before proceeding make sure to install OpenAI's package using the following command:

go get github.com/sashabaranov/go-openai

// structs/openai.go
package structs
import (
    "context"
    "log"
    "github.com/sashabaranov/go-openai"
)

// A simple struct that takes has the APIKEY(really not needed here)
// aswell as the client from the openai package
type OpenAIClient struct {
    APIKey string
    client *openai.Client
}
func NewOpenAIClient(apiKey string) *OpenAIClient {
// instantiate a new struct and return it.
    client := openai.NewClient(apiKey)

    return &OpenAIClient{
        APIKey: apiKey,
        client: client,
    }
}

func (c *OpenAIClient) GetEmbeddingForText(text string) openai.Embedding {
// This method takes in a text string and calls the OPENAI API and returns its embedding
    queryReq := openai.EmbeddingRequest{
        Input: []string{text},
        Model: openai.AdaEmbeddingV2,
    }
    queryResponse, err := c.client.CreateEmbeddings(context.Background(), queryReq)
    if err != nil {
        log.Fatal("Error creating query embedding:", err)
    }
    return queryResponse.Data[0]
}

Connecting to SingleStore Database

After everything with OpenAI is set, we'll be creating another struct to handle integrating with the SingleStore database.

Before proceeding make sure to install the mysql driver for Go as SingleStore requires so in its documentation.

go get github.com/go-sql-driver/mysql

package structs
// structs/mysql.go
import (
    "database/sql"
    "fmt"
    "log"

    _ "github.com/go-sql-driver/mysql"
)

type MySQL struct {
// struct that has the connection object
    db *sql.DB
}

func ConnectToDatabse(username string, password string, host string, port int, database string) *MySQL {
    // creating the connecting url
    connection := username + ":" + password + "@tcp(" + host + ":" + fmt.Sprint(port) + ")/" + database + "?parseTime=true"
    db, err := sql.Open("mysql", connection)
    if err != nil {
        log.Fatal("Error connecting to database:", err)
    }

    return &MySQL{
        db: db,
    }
}

func (mysql *MySQL) CreateDatabase() error {
// We'll name our database embeddings, not the best but whatever :D
    _, err := mysql.db.Exec("CREATE DATABASE IF NOT EXISTS embeddings")

    return err
}

func (mysql *MySQL) CreateTable() (sql.Result, error) {
    // we'll have a table with the same name as the database lol don't do this change it to something else
    // we'll have 2 columns text (the original text) as a string and a blob as a embedding (byte array)
    result, err := mysql.db.Exec("CREATE TABLE IF NOT EXISTS embeddings (text TEXT, embedding BLOB)")

    return result, err
}

func (mysql *MySQL) InsertEmbedding(text string, embedding []byte) (sql.Result, error) {
    // inserts embedding into db.
    res, err := mysql.db.Exec("INSERT INTO embeddings (text, embedding) VALUES (?, ?)", text, embedding)

    return res, err
}

func (mysql *MySQL) GetRelatedEmbeddings(embedding []byte) []string {
    // performs semantic search using something called dot_product and returns the highest score amongst them
    res, err := mysql.db.Query("SELECT text, dot_product(embedding, ?) as similarity FROM embeddings ORDER BY similarity DESC LIMIT 3", embedding)
    if err != nil {
        log.Fatal("Error querying database:", err)
    }
    // will return only an array of the text field.
    var relatedEmbeddings []string
    for res.Next() {
        var text string
        var similarity float32
        err = res.Scan(&text, &similarity)
        if err != nil {
            log.Fatal()
        }
        relatedEmbeddings = append(relatedEmbeddings, text)
    }

    return relatedEmbeddings
}
// always close after closing the application
func (mysql *MySQL) Close() {
    mysql.db.Close()
}

The interesting part in the code snippet above was the dot product function executed in the related embeddings query.

The dot product between two embeddings (also known as vectors) is a mathematical operation that takes two equal-length sequences of numbers and returns a single number. This operation can be used to measure the similarity between two vectors. In the context of word embeddings, the dot product can be used to measure the similarity between two words.

Putting things all together

Now that we connected to both our important parties. We'll create a package embedding that gets the embedding and inserts it into the database. Making it simpler to just send in a text and get an embedding saved into the database.

//embeddings/main.go
package embeddings

import (
    "OPENAI-GO/embeddings/structs"
    "bytes"
    "encoding/binary"
    "fmt"
    "log"
)

func CreateNewEmbedding(text string, dbclient *structs.MySQL, openAIClient *structs.OpenAIClient) {
// Gets the embedding from OpenAI and inserts into our database
    embedding := openAIClient.GetEmbeddingForText(text)
    _, err := dbclient.InsertEmbedding(text, []byte(convertFloatToByte(embedding.Embedding)))
    if err != nil {
        fmt.Println("Error inserting embedding:", err)
    }
}

func GetRelatedEmbeddings(text string, dbclient *structs.MySQL, openAIClient *structs.OpenAIClient) {
// gets the embedding for the required text to search and performs DB semantic search
    embedding := openAIClient.GetEmbeddingForText(text)
    result := dbclient.GetRelatedEmbeddings(convertFloatToByte(embedding.Embedding))

    fmt.Println("Search Results:")
    for _, headline := range result {
        fmt.Println(headline)
    }
    fmt.Println()

}

func convertFloatToByte(embedding []float32) []byte {
// helper function to convert float32 array into a byte array
// this is because our database takes a bytearray (blob) as embedding.
    buf := new(bytes.Buffer)

    for _, v := range embedding {
        err := binary.Write(buf, binary.LittleEndian, v)
        if err != nil {
            log.Fatalf("binary.Write failed: %v", err)
        }
    }

    return buf.Bytes()
}

Creating our simple CLI

Now that we have all of this set up. The last step is to create a simple CLI that lets us add and search for Article Headlines in our case.

Before proceeding download the go package below. It's a simple wonderful package that allows to create simple CLI's.

go getgithub.com/AlecAivazis/survey/v2

package main

import (
    "OPENAI-GO/embeddings/embeddings"
    "OPENAI-GO/embeddings/structs"
    "fmt"

    "github.com/AlecAivazis/survey/v2"
)

func main() {
    // initializing everything (database, openapi client)
    dbClient := structs.ConnectToDatabse("username", "password","host", 3306, "db-name")
    fmt.Println("Database Connected โœ…")
    dbClient.CreateTable()
    fmt.Println("Table Created/Exists โœ…")
    defer dbClient.Close()

    openAIClient := structs.NewOpenAIClient("Api-Key")

    // prompting with 2 items (see options key below)
    var qs = []*survey.Question{
        {
            Name: "name",
            Prompt: &survey.Select{
                Message: "What brings you today? ๐Ÿค”",
                Options: []string{"Add new Article Headline", "Get related headlines"},
            },
        },
    }
    // empty struct that gets filled with our answer when we pick
    answers := struct {
        Name string
    }{}
    // infinite loop that cancels when you press 'c'
    for {
        // prompt the user
        err := survey.Ask(qs, &answers)
        if err != nil {
            fmt.Println(err.Error())
            return
        }
        // loop over the possible picks and act accordingly.
        switch answers.Name {
        case "Add new Article Headline":
            // prompt to enter a headline
            var article string
            prompt := &survey.Input{
                Message: "Enter the article headline",
            }
            err := survey.AskOne(prompt, &article)

            if err != nil {
                fmt.Println(err.Error())
                return
            }
            // save to database.
            embeddings.CreateNewEmbedding(article, dbClient, openAIClient)

            fmt.Println("Article headline added โœ…")

        case "Get related headlines":
            var article string
            // prompt to enter a headline to search for semantically
            prompt := &survey.Input{
                Message: "Enter the article headline you want to search for",
            }
            err := survey.AskOne(prompt, &article)
            if err != nil {
                fmt.Println(err.Error())
                return
            }
            // print the response.
            embeddings.GetRelatedEmbeddings(article, dbClient, openAIClient)
        }
        // make him press enter to continue
        var cont string
        prompt := &survey.Input{
            Message: "Press enter to continue or c to exit",
        }
        err = survey.AskOne(prompt, &cont)
        if err != nil {
            fmt.Println(err.Error())
            return
        }
        if cont == "c" {
        // or 'c' to break & exit.
            break
        }
    }
}

If everything went smooth the final result should look like this.

On adding a new Article:

On searching for related articles:

As you can see I typed in the word "sports" which gave me the related results I inserted before (I seeded a couple of other things related to coffee & animals too).

Summary

When I first started learning about semantic search it was something really cool and I've wanted to make this article for some time now. I hope this walkthrough helped clarify what it's all about without going into deep details. The more I learn about AI related topics the more I'll post so stay tuned ๐Ÿคช

Github Repo for the code: https://github.com/amrelhewy09/semantic-search-go.git

Did you find this article valuable?

Support Amr Elhewy by becoming a sponsor. Any amount is appreciated!

ย