Chunk Upload and Search Guide

This guide will walk you through the process of uploading your first chunks into a Dataset and searching for them. We have multiple patterns on how you can migrate your data to Trieve. This is the simplest one, where you chunk your data manually and upload it directly to Trieve.

Prerequisites

  • You have a Trieve account and are logged in.
  • You have a Dataset created.
  • You have an API Key from the Trieve dashbaord

Step 1: Chunk your data

The first step is to chunk your data into smaller pieces. This method allows you to specify how you want your data to be chunked and indexed within Trieve for better search results. You can chunk your data in multiple ways, such as by lines, by words, or by a custom delimiter. For this guide, we will chunk a text file into 10 sentence chunks.

import re

def chunk_text_file(file_path, chunk_size = 10):
    with open(file_path, 'r') as file:
        text = file.read()
        sentences = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', text) # Split the text into sentences
        chunks = [sentences[i:i + chunk_size] for i in range(0, len(sentences), chunk_size)] # Split the sentences into chunks of 10
        return chunks

Step 2: Upload your chunks

Now that you have your chunks, you can upload them to your Dataset using the Trieve API. You will need your API Key and Dataset ID to do this.

import requests

def upload_chunks(api_key, dataset_id, chunks):
    url = f'https://api.trieve.ai/api/chunk'
    headers = {
        'Authorization': api_key, # Your Trieve API Key from the dashboard
        'TR-Dataset': dataset_id, # Your Dataset ID of the dataset you created in the dashboard
        'Content-Type': 'application/json'
    }
    for chunk in chunks:
        data = {
            'chunk_html': chunk, # The text content of the chunk you want indexed
            'link': "https://example.com", # The link to the chunk if it exists on the web
            'tag_set': ["tag1", "tag2", "tag3"], # Tags to be associated with the chunk. You can use these tags to filter your search results
            'timestamp': "2024-03-02T03:02:44Z", # Can be in almost any format which contains a date and time
            'tracking_id': "chunk1", # Any unique identifier you want to associate with the chunk to help you correlate it with your data
            'metadata': {"key": "value"} # Any additional metadata you want to associate with the chunk
            'weight': 1 # You can use this param to give a weight to the chunk. i.e. > 1 will push it up in the results, < 1 will push it down in the results
        }
        response = requests.post(url, headers=headers, json=data)
        print(response.json())

Step 3: Search for your chunks

Now that you have uploaded your chunks, you can search for them using the Trieve API.

def search_chunks(api_key, dataset_id, query):
    url = f'https://api.trieve.ai/api/chunk/recommend'
    headers = {
        'Authorization': api_key,
        'TR-Dataset': dataset_id,
        'Content-Type': 'application/json'
    }
    data = {
        'positive_chunk_ids': [], # You can use this param to specify the chunk IDs you want to get similar chunks for
        'negative_chunk_ids': [], # You can use this param to specify the chunk IDs you want to exclude from the results
        'postive_tracking_ids': [], # You can use this param to specify the tracking IDs you want to get similar chunks for
        'negative_tracking_ids': [], # You can use this param to specify the tracking IDs you want to exclude from the results
        'limit': 10, # You can use this param to specify the number of results you want to get back
        'filters': {
            "must": [ # All the must filters are combined using AND and have to match for the result to be returned
                {"field": "tag_set", "match": ["Tag you want to filter by"]} # use match for text metadata fields
            ],
            "must_not": [ # All the must_not filters are combined using AND and have to NOT match for the result to be returned
                {"field": "timestamp", "range": {
                    "gte": 1000043043044, # use range for numerical metadata fields (timestamp is stored as a unix timestamp)
                    "lte": 1200030030000 
                }} # use metdata
            ],
            "should": [ # All the should filters are combined using OR and at least one of them has to match for the result to be returned
                {"field": "metadata.key", "match": "value"} # use metadata.field to filter based on fields within your metadata JSON object
            ]
        }
    }
    response = requests.post(url, headers=headers, json=data)
    print(response.json())

Conclusion

You have successfully uploaded your first chunks into a Dataset and get recommendations for them using the Trieve API. You can now use this process to upload and get recommendations for your own data.