Chunk duplicates
One of the stand-out features of Trieve is how we detect and handle duplicates. In this guide, we’ll look at how Trieve handles duplicates to achieve maximal-marginal-relevance (MMR).
Our primary competitors like Vectara and Pinecone are flat out missing this feature with no intent to implement it. This is a huge advantage for Trieve, as it allows us to provide a much better user experience.
What is MMR?
Maximal-marginal-relevance (MMR) is a technique used in information retrieval to select a diverse set of results. It is used to prevent redundancy in search results, and is used in Trieve to prevent duplicate chunks from being returned in search results on a top-level.
Why do we call duplicates "collisions" in Trieve?
When you vector'ize data with an embedding model, you are converting it into a vector that exists in some n
dimensional space.
The distance between two vectors is then the the "similarity" between the two pieces of data. Vectors can be close enough that you can imagine them actually colliding in the n
dimensional space as a "semantic collision".
In Trieve, you can set an environment variable called DUPLICATE_DISTANCE_THRESHOLD
to determine how similar two vectors have to be to be considered a duplicate. The default value is 0.95
, which means that if two vectors have a cosine similarity of 0.95 or above, they are considered duplicates.
How Trieve's business logic for handling duplicates works
1. The ChunkCollisions model
We create the chunk_collisions
SQL table in this migration. There are a couple of sub-sequent migrations to edit the table, but the core of the table is created in that migration.
The ChunkCollisions model is then defined in this file as shown below:
#[derive(Debug, Serialize, Deserialize, Queryable, Selectable, Insertable, Clone)]
#[diesel(table_name = chunk_collisions)]
pub struct ChunkCollisions {
pub id: uuid::Uuid,
pub chunk_id: uuid::Uuid,
pub collision_qdrant_id: Option<uuid::Uuid>,
pub created_at: chrono::NaiveDateTime,
pub updated_at: chrono::NaiveDateTime,
}
Each ChunkCollisions
row will relate to a ChunkMetadata
row in the chunk_metadata
table via the chunk_id
column. The collision_qdrant_id
column is used to relate the ChunkCollisions
row to a Qdrant
point in the qdrant
db.
2. When a chunk is created, we check for collisions
When a chunk is created, we find the nearest point using a function called first_semantic_result
as follows:
let first_semantic_result = global_unfiltered_top_match_query(embedding_vector.clone())
.await
.map_err(|err| {
ServiceError::BadRequest(format!(
"Could not get semantic similarity for collision check: {}",
err.message
))
})?;
3. If the first_semantic_result
is within the DUPLICATE_DISTANCE_THRESHOLD
of 0.95
, we create a ChunkCollisions
row and not a Qdrant
point
If the first_semantic_result
is within the DUPLICATE_DISTANCE_THRESHOLD
of 0.95
, we then only create a ChunkCollisions
row and not a Qdrant
point. This is because we don't want to create a Qdrant
point for a duplicate chunk. It will worsen search time performance and prevent MMR from working.
The implementation of this logic can be found in chunk_handler.rs here.
The result in our default search UI
If you search for "Enron’s liquidity is very solid and there is no need for any panic or alarm" on our Enron demo then you will see arrows on the sides of resulting chunks as shown below:
These arrows allow you to navigate between chunks that are duplicates of each other.
Thanks for reading! If you have any questions, please reach out to us on Twitter, Matrix, or Discord.