Sparse Versus Dense Vectors

July 29, 2024

What Are Sparse Vectors

Sparse vectors are a type of vector that contains mostly zeros as opposed to dense vectors which store all their elements with values. In dense vectors, each element is represented by a number, whereas in sparse vectors, only the non-zero elements are represented by numbers, and the zero elements are not stored at all. This means that sparse vectors can be more efficient in terms of storage and computation than dense vectors, but may also lose some information (if the data is too sparse).

TDIF & BM25 are sparse vectors. Their basic method of transforming a count matrix into a set of numerical features that can be used as input into a machine learning algorithm. Its resulting vector contains non-zero values only for terms that are present in the document and zero values for all other terms.

Sparse Vector Example

Combining a method like this with search is computationally efficient for information retrieval. They aren’t very ‘forgiving’ though when it comes to missing values and they don’t capture the relationship or order. This makes them less useful for systematic retrieval.

How does this relate to AI and ELSER

ELSER (Elastic’s custom model), SPLADE , and others perform term weighting and expansion. As you can see in Elastic’s Search Lab Blog . If you request an inference for “Is Pluto a Planet?”, the inference results are:


{
  "inference_results": [
    {
      "predicted_value": {
        "pluto": 3.014208,
        "planet": 2.6253395,
        "planets": 1.7399588,
        "alien": 1.1358738,
        "mars": 0.8806293,
        "genus": 0.8014013,
        "europa": 0.6215426,
        "a": 0.5890018,
        "asteroid": 0.5530223,
        "neptune": 0.5525891,
        "universe": 0.5023148,
        "venus": 0.47205976,
        "god": 0.37106854,
// Shortened for brevity
      }
    }
  ]
}

These can be stored in the index in a similar manner to normal tokens. When you combine the expansion on the query side and the index side you can see significantly improved recall for your queries.

The Advantages

The reason all of this is important is because you can use these techniques to improve your search results. You get the semantic capabilities of dense vector search but with the traditional lexical search performance characteristic.

Memory usage

Qdrant published an article on the benefits of sparse vectors and noted the size improvements. This table is copied from that article:

Vector Type	Memory (GB)
Dense BERT Vector	6.144
OpenAI Embeddings	12.288
Sparse Vector	1.12

As you can see, the Sparse Vectors are much smaller than the other two. This means that you can store a lot more data in a single index. Additionally, Sparse Vectors are CPU friendly.

MC+A also published an article detailing the differences in vector size. The key is that dense vectors are stored in memory in Elasticsearch outside of the JVM which adds significant resource requirements.

Relevance

Both SPLADE and Elastic have been published demonstrating the benefits of using sparse vectors. You can see Elastic’s article here .

It’s important to test your queries and data before deciding which approach works for you. At MC+A we did not see the improvement that others have claimed so it is going to be use case by use case.

Take away

Sparse vectors like ELSER integrate into Elasticsearch to perform text expansion. Doing this gives you a ‘semantic light’ approach that can provide some significant improvements. You have to test your use case and the typical queries your users are running but it is worth looking into.