February 2025

Evaluate LLM Apps in Go

As Large Language Models (LLMs) become a bigger part of our apps, making sure they work reliably brings some challenges. Unlike traditional code with predictable outputs, LLMs are indeterministic and can output downright weird stuff. That’s where good evaluation tools come in! In this post, I’ll show you the eval package from my newly renamed maragu.dev/gai module.

I’ll show you how Go developers can use this package to systematically evaluate LLM applications while working seamlessly with Go’s existing testing tools. No need to learn a whole new framework! (*cough* LangChain *cough*.)

Understanding LLM Evaluation §

Traditional software testing usually focuses on “Did I get exactly what I expected?”. But with LLMs, which are not deterministic, we need to think about things like:

Accuracy: How close is this to what we wanted?
Relevance: Does it actually answer the question?
Consistency: Does the model behave similarly for similar inputs? And consistently over time?

Testing LLMs isn’t quite the same as testing regular code. That’s why we talk about “evaluation” rather than just “testing”. Evaluation lets us measure how well the model is doing without expecting perfect matches every time.

I like to do evals like Test-Driven Development (TDD); let’s call it Evaluation-Driven Development:

Write your evaluations first — define what good responses should look like, given your input prompt.
Hook up your LLM and write the prompt.
Run the evals to see how you’re doing.
Tweak your prompts or settings to get better results.
Rinse and repeat until you’re happy with the performance.
Monitor the results over time.

As Chip Huyen explains in her book “AI Engineering”, this kind of systematic evaluation is essential for building reliable AI systems that can be maintained and improved over time. Evaluation isn’t just a final step, it should be integrated into your entire development process from the beginning!

Getting Started with the eval package §

Alrighty, let’s look at some code.

First, grab my maragu.dev/gai module:

$ go get maragu.dev/gai

The maragu.dev/gai/eval package gives you a lightweight way to measure LLM performance. It has three main pieces:

Sample: Contains your input prompt, what you expected to get back, and what you actually got.
Scorer: Functions that give you a score between 0 and 1 to measure how good the response is.
Result: Combines the score with some metadata about how it was calculated.

What makes this package really nice for Go developers is how it plugs right into Go’s testing framework. You can run your LLM evaluations alongside your regular tests, using the same commands and tools you already know.

LLM Evaluation example §

Let’s walk through a simple evaluation. First, set up a basic test file:

package examples_test

import (
  "testing"

  "maragu.dev/gai/eval"
)

// TestEvalPrompt evaluates the Prompt method.
// All evals must be prefixed with "TestEval".
func TestEvalPrompt(t *testing.T) {
  // Evals only run if "go test" is being run with "-test.run=TestEval", e.g.: "go test -test.run=TestEval ./..."
  eval.Run(t, "answers with a pong", func(e *eval.E) {
    // Initialize our intensely powerful LLM.
    llm := &powerfulLLM{response: "plong"}

    // Send our input to the LLM and get an output back.
    input := "ping"
    output := llm.Prompt(input)

    // Create a sample to pass to the scorer.
    sample := eval.Sample{
      Input:    input,
      Output:   output,
      Expected: "pong",
    }

    // Score the sample using the Levenshtein distance scorer.
    // The scorer is created inline, but for scorers that need more setup, this can be done elsewhere.
    result := e.Score(sample, eval.LexicalSimilarityScorer(eval.LevenshteinDistance))

    // Log the sample, result, and timing information.
    e.Log(sample, result)
  })
}

type powerfulLLM struct {
  response string
}

func (l *powerfulLLM) Prompt(request string) string {
  return l.response
}

Here’s what’s going on:

Evaluation tests need to start with TestEval so they’ll be recognized as both Go tests and evals.
The eval.Run function works like t.Run, but makes a bit of setup, and ensures that evals are skipped during regular test runs.
We call our mocked LLM, create a Sample, score the sample with LexicalSimilarityScorer, and log the result.
Results get logged to an evals.jsonl file that you can analyze later.

To run this evaluation, just use the Go test command with a filter:

$ go test -run TestEval ./...

This command runs only tests that start with “TestEval” and skips all other regular tests. Conversely, during normal test runs (like go test ./...), the eval.Run function automatically detects that it’s not being specifically targeted and will skip the evaluation tests. This ensures your evaluations won’t slow down your normal development workflow, but you can still run them when you need to.

Scorers? What even is? §

The eval package currently gives you a few ways to score your LLM responses:

Lexical Similarity §

This measures how closely the text matches what you expected, character by character:

result := e.Score(sample, eval.LexicalSimilarityScorer(eval.LevenshteinDistance))

Levenshtein distance counts the minimum number of edits needed to change one string into another. It’s great for when you need responses to be very close to your reference answer.

For strict exact matching, you can do:

result := e.Score(sample, eval.LexicalSimilarityScorer(eval.ExactMatch))

Semantic Similarity §

This goes deeper than just comparing characters — it looks at the meaning of the response:

embedder := &embeddingGetter{}
result := e.Score(sample, eval.SemanticSimilarityScorer(embedder, eval.CosineSimilarity))

This approach needs something that can convert text into vector representations (embeddings), which are then compared using cosine similarity. It’s super useful when you care more about the meaning than the exact wording.

Semantic similarity is sometimes more appropriate for evaluating LLM outputs than exact matching, since multiple valid phrasings can express the same meaning. It then makes sense to use embedding-based metrics when evaluating conceptual accuracy rather than precise wording.

Here’s a quick example of implementing the embedding interface with OpenAI:

type embeddingGetter struct{}

func (e *embeddingGetter) GetEmbedding(v string) ([]float64, error) {
    client := gai.NewOpenAIClient(gai.NewOpenAIClientOptions{Key: env.GetStringOrDefault("OPENAI_KEY", "")})
    res, err := client.Client.Embeddings.New(context.Background(), openai.EmbeddingNewParams{
        Input:          openai.F[openai.EmbeddingNewParamsInputUnion](shared.UnionString(v)),
        Model:          openai.F(openai.EmbeddingModelTextEmbedding3Small),
        EncodingFormat: openai.F(openai.EmbeddingNewParamsEncodingFormatFloat),
        Dimensions:     openai.F(int64(128)),
    })
    if err != nil {
        return nil, err
    }
    if len(res.Data) == 0 {
        return nil, errors.New("no embeddings returned")
    }
    return res.Data[0].Embedding, nil
}

A Note on Development Status §

I should mention that my maragu.dev/gai module is still under heavy development, and breaking changes will likely occur.

At the moment, only these two Scorers (lexical and semantic similarity) exist, but I’m working on adding more specialized evaluation methods soon. In particular, I’m planning an LLM-as-a-judge Scorer. More on that later.

Integrating evals into your development workflow §

During development, you can run your evals to make sure your changes don’t break anything:

$ go test -run TestEval ./...

This gives you immediate feedback on how your changes affect your LLM’s performance.

You can also add evaluations to your CI pipeline. Here’s an outline of a GitHub Actions setup:

jobs:
  evaluate:
    name: Evaluate
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version-file: go.mod
          
      - name: Evaluate
        run: go test -run TestEval ./...
        env:
          LLM_KEY: ${{ secrets.LLM_KEY }}
          
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: evals.jsonl
          path: evals.jsonl

(I’m also working on running evals in CI and presenting the results directly in the PR.)

Tracking Performance Over Time §

The evals.jsonl file contains all the details about each evaluation run:

The input and what you expected to get
What the LLM actually gave you
Scores from different scoring methods
How long things took to run

This structured format makes it easy to build dashboards or reports to track how your LLM performance evolves over time.

(And now that I mention it (why, thank you, Markus!), I’m also building evals.fun to take eval results in and spit out some nice graphs, to track changes over time. Early days for that one.)

Conclusion §

The eval package from my maragu.dev/gai module gives Go developers a nice little tool for systematically evaluating LLM performance, fitting right into Go’s testing infrastructure. By taking an evaluation-driven approach, you can:

Build more reliable LLM applications
Track improvements over time
Make sure your changes don’t break things

This approach brings the discipline of software testing to AI development, helping the transition between traditional code and the more unpredictable world of LLMs.

While the maragu.dev/gai module is still in its early stages, the evaluation framework provides a useful starting point for LLM testing. I’m actively developing more scorers and evaluation tools. Just be aware that the API is not yet finalized, and breaking changes are likely as the module evolves. I welcome contributions and ideas from Go developers building with LLMs. Go create those issues!

Also, join the r/LLMGophers subreddit! :D

Resources §

maragu.dev/gai on Github
AI Engineering by Chip Huyen. It’s great, and I haven’t even finished it yet. Covers a lot of ground.
Prompt Engineering for LLMs by John Berryman and Albert Ziegler. As the title suggests, focuses more on prompt engineering, but has a chapter on evals as well.

I’m Markus, an independent software consultant. 🤓✨

See my services or reach out at markus@maragu.dk.

Subscribe to this blog by RSS or newsletter: