Chunk duplicates

One of the stand-out features of Trieve is how we detect and handle duplicates. In this guide, we’ll look at how Trieve handles duplicates to achieve maximal-marginal-relevance (MMR).

Our primary competitors like Vectara and Pinecone are flat out missing this feature with no intent to implement it. This is a huge advantage for Trieve, as it allows us to provide a much better user experience.

What is MMR?

Maximal-marginal-relevance (MMR) is a technique used in information retrieval to select a diverse set of results. It is used to prevent redundancy in search results, and is used in Trieve to prevent duplicate chunks from being returned in search results on a top-level.

Why do we call duplicates "collisions" in Trieve?

When you vector'ize data with an embedding model, you are converting it into a vector that exists in some n dimensional space.

The distance between two vectors is then the the "similarity" between the two pieces of data. Vectors can be close enough that you can imagine them actually colliding in the n dimensional space as a "semantic collision".

In Trieve, you can set an environment variable called DUPLICATE_DISTANCE_THRESHOLD to determine how similar two vectors have to be to be considered a duplicate. The default value is 0.95, which means that if two vectors have a cosine similarity of 0.95 or above, they are considered duplicates.

How Trieve's business logic for handling duplicates works

1. The ChunkCollisions model

We create the chunk_collisions SQL table in this migration. There are a couple of sub-sequent migrations to edit the table, but the core of the table is created in that migration.

The ChunkCollisions model is then defined in this file as shown below:

#[derive(Debug, Serialize, Deserialize, Queryable, Selectable, Insertable, Clone)]
#[diesel(table_name = chunk_collisions)]
pub struct ChunkCollisions {
    pub id: uuid::Uuid,
    pub chunk_id: uuid::Uuid,
    pub collision_qdrant_id: Option<uuid::Uuid>,
    pub created_at: chrono::NaiveDateTime,
    pub updated_at: chrono::NaiveDateTime,
}

Each ChunkCollisions row will relate to a ChunkMetadata row in the chunk_metadata table via the chunk_id column. The collision_qdrant_id column is used to relate the ChunkCollisions row to a Qdrant point in the qdrant db.

2. When a chunk is created, we check for collisions

When a chunk is created, we find the nearest point using a function called first_semantic_result as follows:

let first_semantic_result = global_unfiltered_top_match_query(embedding_vector.clone())
  .await
  .map_err(|err| {
      ServiceError::BadRequest(format!(
          "Could not get semantic similarity for collision check: {}",
          err.message
      ))
  })?;

3. If the first_semantic_result is within the DUPLICATE_DISTANCE_THRESHOLD of 0.95, we create a ChunkCollisions row and not a Qdrant point

If the first_semantic_result is within the DUPLICATE_DISTANCE_THRESHOLD of 0.95, we then only create a ChunkCollisions row and not a Qdrant point. This is because we don't want to create a Qdrant point for a duplicate chunk. It will worsen search time performance and prevent MMR from working.

The implementation of this logic can be found in chunk_handler.rs here.

The result in our default search UI

If you search for "Enron’s liquidity is very solid and there is no need for any panic or alarm" on our Enron demo then you will see arrows on the sides of resulting chunks as shown below:

Arrows on the sides of resulting chunks

These arrows allow you to navigate between chunks that are duplicates of each other.

Thanks for reading! If you have any questions, please reach out to us on Twitter, Matrix, or Discord.