As Large Language Models (LLMs) become a bigger part of our apps, making sure they work reliably brings some challenges. Unlike traditional code with predictable outputs, LLMs are indeterministic and can output downright weird stuff. That’s where good evaluation tools come in! In this post, I’ll show you the eval
package from my newly renamed maragu.dev/gai
module.
I’ll show you how Go developers can use this package to systematically evaluate LLM applications while working seamlessly with Go’s existing testing tools. No need to learn a whole new framework! (*cough* LangChain *cough*.)
Traditional software testing usually focuses on “Did I get exactly what I expected?”. But with LLMs, which are not deterministic, we need to think about things like:
Testing LLMs isn’t quite the same as testing regular code. That’s why we talk about "evaluation" rather than just "testing". Evaluation lets us measure how well the model is doing without expecting perfect matches every time.
Think of it like Test-Driven Development (TDD), but with a twist. We can do what I like to call Evaluation-Driven Development:
As Chip Huyen explains in her book “AI Engineering”, this kind of systematic evaluation is essential for building reliable AI systems that can be maintained and improved over time. Evaluation isn’t just a final step — it should be integrated into your entire development process.
Alrighty, let’s look at some code.
First, grab my maragu.dev/gai
module:
$ go get maragu.dev/gai
The eval
package gives you a lightweight way to measure LLM performance. It has three main pieces:
What makes this package really nice for Go developers is how it plugs right into Go’s testing framework. You can run your LLM evaluations alongside your regular tests, using the same commands and tools you already know.
Let’s walk through a simple evaluation. First, set up a basic test file:
package examples_test
import (
"testing"
"maragu.dev/gai/eval"
)
// TestEvalPrompt evaluates the Prompt method.
// All evals must be prefixed with "TestEval".
func TestEvalPrompt(t *testing.T) {
// Evals only run if "go test" is being run with "-test.run=TestEval", e.g.: "go test -test.run=TestEval ./..."
eval.Run(t, "answers with a pong", func(e *eval.E) {
// Initialize our intensely powerful LLM.
llm := &powerfulLLM{response: "plong"}
// Send our input to the LLM and get an output back.
input := "ping"
output := llm.Prompt(input)
// Create a sample to pass to the scorer.
sample := eval.Sample{
Input: input,
Output: output,
Expected: "pong",
}
// Score the sample using the Levenshtein distance scorer.
// The scorer is created inline, but for scorers that need more setup, this can be done elsewhere.
result := e.Score(sample, eval.LexicalSimilarityScorer(eval.LevenshteinDistance))
// Log the sample, result, and timing information.
e.Log(sample, result)
})
}
type powerfulLLM struct {
response string
}
func (l *powerfulLLM) Prompt(request string) string {
return l.response
}
Here’s what’s going on:
TestEval
so they’ll be recognized as both Go tests and evals.eval.Run
function works like t.Run
, but makes a bit of setup, and ensures that evals are skipped during regular test runs.Sample
, score the sample with LexicalSimilarityScorer
, and log the result.evals.jsonl
file that you can analyze later.To run this evaluation, just use the Go test command with a filter:
$ go test -run TestEval ./...
This command runs only tests that start with “TestEval” and skips all other regular tests. Conversely, during normal test runs (like go test ./...
), the eval.Run
function automatically detects that it’s not being specifically targeted and will skip the evaluation tests. This ensures your evaluations won’t slow down your normal development workflow, but you can still run them when you need to.
The eval
package currently gives you a few ways to score your LLM responses:
This measures how closely the text matches what you expected, character by character:
result := e.Score(sample, eval.LexicalSimilarityScorer(eval.LevenshteinDistance))
Levenshtein distance counts the minimum number of edits needed to change one string into another. It’s great for when you need responses to be very close to your reference answer.
For strict exact matching, you can do:
result := e.Score(sample, eval.LexicalSimilarityScorer(eval.ExactMatch))
This goes deeper than just comparing characters — it looks at the meaning of the response:
embedder := &myEmbeddingProvider{} // Implements the embeddingGetter interface
result := e.Score(sample, eval.SemanticSimilarityScorer(embedder, eval.CosineSimilarity))
This approach needs something that can convert text into vector representations (embeddings), which are then compared using cosine similarity. It’s super useful when you care more about the meaning than the exact wording.
Semantic similarity is sometimes more appropriate for evaluating LLM outputs than exact matching, since multiple valid phrasings can express the same meaning. It then makes sense to use embedding-based metrics when evaluating conceptual accuracy rather than precise wording.
Here’s a quick example of implementing the embedding interface with OpenAI:
type embeddingGetter struct{}
func (e *embeddingGetter) GetEmbedding(v string) ([]float64, error) {
client := gai.NewOpenAIClient(gai.NewOpenAIClientOptions{Key: env.GetStringOrDefault("OPENAI_KEY", "")})
res, err := client.Client.Embeddings.New(context.Background(), openai.EmbeddingNewParams{
Input: openai.F[openai.EmbeddingNewParamsInputUnion](shared.UnionString(v)),
Model: openai.F(openai.EmbeddingModelTextEmbedding3Small),
EncodingFormat: openai.F(openai.EmbeddingNewParamsEncodingFormatFloat),
Dimensions: openai.F(int64(128)),
})
if err != nil {
return nil, err
}
if len(res.Data) == 0 {
return nil, errors.New("no embeddings returned")
}
return res.Data[0].Embedding, nil
}
I should mention that my maragu.dev/gai
module is still under heavy development, and breaking changes will likely occur.
At the moment, only these two Scorers (lexical and semantic similarity) exist, but I’m working on adding more specialized evaluation methods soon. In particular, I’m planning an LLM-as-a-judge Scorer. More on that later.
During development, you can run your evals to make sure your changes don’t break anything:
$ go test -run TestEval ./...
This gives you immediate feedback on how your changes affect your LLM’s performance.
You can also add evaluations to your CI pipeline. Here’s an outline of a GitHub Actions setup:
jobs:
evaluate:
name: Evaluate
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod
- name: Evaluate
run: go test -run TestEval ./...
env:
LLM_KEY: ${{ secrets.LLM_KEY }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: evals.jsonl
path: evals.jsonl
(I’m also working on running evals in CI and presenting the results directly in the PR.)
The evals.jsonl
file contains all the details about each evaluation run:
This structured format makes it easy to build dashboards or reports to track how your LLM performance evolves over time.
(And now that I mention it (why, thank you, Markus!), I’m also building evals.fun to take eval results in and spit out some nice graphs, to track changes over time. Early days for that one.)
The eval
package from my maragu.dev/gai
module gives Go developers a nice little tool for systematically evaluating LLM performance, fitting right into Go’s testing infrastructure. By taking an evaluation-driven approach, you can:
This approach brings the discipline of software testing to AI development, helping the transition between traditional code and the more unpredictable world of LLMs.
While the maragu.dev/gai
module is still in its early stages, the evaluation framework provides a useful starting point for LLM testing. I’m actively developing more scorers and evaluation tools. Just be aware that the API is not yet finalized, and breaking changes are likely as the module evolves. I welcome contributions and ideas from Go developers building with LLMs. Go create those issues!
Also, join the r/LLMGophers subreddit! :D
I’m Markus, an independent software consultant. 🤓✨
See my services or reach out at markus@maragu.dk.
Subscribe to this blog by RSS or newsletter: