pgvector Without Embeddings: When a Feature Vector Beats Semantic Search

Almost every pgvector tutorial starts the same way. Take some text, run it through an embedding model, store the resulting vector, and search it with natural language. That is a real and useful pattern, and it is also why most engineers walk away thinking pgvector is a tool for one job: semantic search over text.

It is not. pgvector is a tool for finding the nearest vectors to a target vector. Where those vectors come from is entirely up to you. An embedding model is one source. A column of numbers you computed yourself is another, and for a large class of problems it is the better one.

We learned this building a "find similar players" feature for a baseball side project, and the lesson generalizes to almost any similarity problem over structured data. Here is the distinction nobody draws clearly, and how to know which side of it you are on.

What pgvector actually does

Strip away the AI framing and pgvector adds three things to Postgres. A column type, vector(N), that stores an array of N floats. A set of distance operators between two of those columns: <-> for L2 (Euclidean) distance, <=> for cosine distance, and <#> for negative inner product. And approximate-nearest-neighbor indexes (HNSW and IVFFlat) so that finding the closest vectors to a target stays fast as the table grows.

That is the whole job. Given a target vector, return the rows whose vectors are closest to it, ranked, quickly. pgvector has no opinion about whether your 384 numbers came out of a transformer or out of a SELECT and some arithmetic. The math is identical.

This matters because the moment you stop thinking "embedding column" and start thinking "any vector I can construct," a different design space opens up.

Embeddings versus feature vectors

An embedding is a vector produced by a model. You hand a chunk of text or an image to something like a sentence transformer or CLIP, and it returns a few hundred floats that encode the meaning of the input in a way you did not design and cannot fully interpret. Embeddings shine when the input is unstructured and you do not know in advance which features matter. You could never write down by hand what makes two paragraphs "similar in meaning." The model learned that for you.

A feature vector is a vector you build deliberately. You decide what the dimensions are, you compute each one from your own data, and you normalize them so they live on a comparable scale. Every dimension means something you can name. Feature vectors shine when the input is structured and you already know what makes two things similar.

The trap is treating these as the same thing because they both end up in a vector column. They are not. One is a learned, opaque representation of unstructured data. The other is an explicit, interpretable description of structured data. Calling a hand-built feature vector an "embedding" because it sounds more sophisticated is the kind of overstatement that falls apart the moment someone reads your code and finds a SELECT AVG(...) where they expected a model call. Name it for what it is.

The worked example: comparable players

We wanted to answer a coach's question: "show me pitchers similar to this one." Similarity here is not one number. A pitcher is a bundle of tendencies. What does he throw and how often. Where he locates each pitch. How hard he throws it. How his behavior changes with the count. No single SQL WHERE clause captures "pitches like this guy," which is the signal that you are in vector territory rather than filter territory.

So we built a feature vector per pitcher straight from the pitch-by-pitch data:

Arsenal mix: the percentage of each pitch type (fastball, slider, changeup, and so on). One dimension per type.
Location: the mean and spread of the horizontal and vertical location for each pitch type, computed from normalized zone coordinates.
Velocity: average and range, where it exists.
Count tendencies: how the pitch mix shifts in hitter-friendly versus pitcher-friendly counts.

Each of those is a plain aggregate over the pitcher's rows. Concatenate them, normalize, and you have a vector that describes how a pitcher actually operates. Two pitchers who attack the zone the same way land near each other in that space, and pgvector's nearest-neighbor search finds them.

The schema is unremarkable:

CREATE EXTENSION IF NOT EXISTS vector;

ALTER TABLE pitcher_profiles
  ADD COLUMN feature_vec vector(32);

CREATE INDEX ON pitcher_profiles
  USING hnsw (feature_vec vector_cosine_ops);

And the query is the part people expect to be hard and is not:

SELECT id, name
FROM pitcher_profiles
WHERE id <> @target_id
ORDER BY feature_vec <=> @target_vec
LIMIT 10;

In a .NET app with EF Core and the Pgvector provider, building the target vector and running that order-by is a few lines. The intelligence is not in the query. It is in the feature design.

Two things that decide whether it works

Normalization. Your dimensions start on wildly different scales. A pitch-mix percentage lives between 0 and 1. A velocity average might be 88. If you drop both into a vector raw, velocity dominates the distance and the mix percentages become noise. Standardize every dimension (z-score across the population, or min-max into a fixed range) before you store it, or your "similarity" just measures whichever feature happens to be biggest.

Weighting, which is the feature vector's superpower. Because you built the dimensions, you can decide that arsenal mix matters twice as much as location by scaling those dimensions before the distance is computed. You cannot do that with a black-box embedding. You get whatever the model decided was important. When you know your domain, that control is worth more than any learned representation, and it is the single best argument for the hand-built approach.

One honest caveat from our own data: velocity was frequently missing, because the entry UI had no radar input. A vector with a sparse dimension behaves badly, since "missing" is not the same as "zero." Design the vector so it degrades gracefully when a feature is absent, either by imputing the population mean or by dropping the dimension and renormalizing. Real structured data is messier than the tutorial version, and the feature approach at least lets you see and handle that, because you can read every dimension.

When embeddings still win, and when to use neither

Feature vectors are the right tool when the input is structured and you can name what matters. Flip either of those and you want something else.

If the input is unstructured, embeddings win. Free-text scouting notes, a coach's written report, an image: you cannot hand-engineer features for those, so let a model do it. The strongest systems often run both. Use a feature vector for the structured tendencies and an embedding for the prose notes, then combine the two scores. pgvector holds both columns happily.

And sometimes the answer is no vector at all. If "similar" really comes down to one or two numeric columns, you do not need pgvector, you need a WHERE clause and an ORDER BY. Reaching for vector search there is theater. It looks advanced and adds a dependency to do what an index on a column already does. The gut check we use: vectors earn their place only when similarity spans many dimensions at once and no single filter captures it. Below that bar, plain SQL is not just adequate, it is better.

The takeaway

pgvector is a nearest-neighbor engine, not a semantic-search feature. Once you internalize that, "do I need an embedding model here" becomes a real question with a real answer instead of a reflex. For unstructured data where the important features are unknowable, embed. For structured data where you know exactly what makes two things alike, build the feature vector yourself, normalize it, weight it, and let pgvector do the search. You will ship something simpler, faster, and fully explainable, and you will be able to answer the only question that matters in a code review: why are these two results considered similar.

The honest version of that answer is worth more than the impressive-sounding one.

Agave Information Solutions builds custom software, data architecture, and on-premises AI systems out of Scottsdale, Arizona. If you have a similarity, search, or retrieval problem and are not sure whether you need a model or just better-shaped data, get in touch.

Websites & Local SEO

Development

AI & Infrastructure

Commerce & Trust