Overview

We provide the ability for users to upload their files to Trieve and use our automatic large language vision model or Apache Tika chunking. When uploading a file to Trieve, we automatically group the chunks together to link them. This is done through our upload file route.

Uploading a File to Trieve

Trieve supports various file types (e.g., HTML, DOCX, PDF). The file is uploaded to a S3 bucket associated with your dataset. As a user, you can use Apache Tika to convert these files to HTML or use a vision LLM to convert the files to markdown. To understand the difference between the two, refer to the section on Apache Tika vs vision model chunking.

Important Parameters

  • base64_file: To allow users to pass metadata with their file uploads, we require files to be base64 URL encoded. Convert + to -, / to _, and remove the ending = if present.
  • file_name: The name of the file being uploaded, including the extension. This will become the name of the resulting group.
  • group_tracking_id: This field allows you to assign an arbitrary ID to the group, aiding in coordination with your database system. You can search for this group using this ID.
  • link, tag_set, and time_stamp: These fields are indexed to enable fast filtering of groups based on these attributes.
  • target_splits_per_chunk: This is an optional field to specify number of splits you want per chunk. If not specified, the default 20 is used.
  • metadata: This field allows you to include any arbitrary metadata in the form of a JSON object with the group.
  • pdf2md_options: This allows you to use vision LLM to convert the files to markdown.
    • use_pdf2md_ocr: If true, the file will be converted to markdown using vision LLM. You can test pdf2md performance at pdf2md.trieve.ai.

For the best filtering performance, we recommend using the link, tag_set, and time_stamp fields, as there are dedicated indexes for these. The metadata field has an index built for match queries but is not optimized for range queries.

Example Upload File Request

Whenever you make a request to the Trieve API, you need to include the TR-Dataset header with your dataset ID and the Authorization header with your API key.

Example Response

File chunking with vision LLM

When uploading a file to Trieve, you can use vision LLMs through our pdf2md service for intelligent document parsing to easily convert PDF content to LLM-ready, structured Markdown. You can test pdf2md performance at pdf2md.trieve.ai.

This allows for better preserving document context, readability, and structure and is especially useful when working with documents containing complex layouts (tables, lists, code blocks, etc).

After the uploaded file is converted into structured Markdown, chunks are created based on the semantic structure of the document allowing for better semantic coherence.

Apache Tika vs Vision LLM chunking

Apache Tika is a more traditional approach that extracts raw text from a document and converts it into HTML. The chunks are subsequently created by splitting at fixed lengths or pre-defined delimeters.

Recommended for:

  • Simple, unstructured documents (e.g. plain text files)

Vision LLM chunking creates chunks based on the semantic structure of the document, allowing for sections (e.g. headers, lists, tables, etc) to stay grouped.

Recommended for:

  • Complex document structures (e.g. technical documentation and reports)
  • Documents where context must be preserved

Heading Based Chunking

Heading based chunking allows you to use split contents based on the headings and body content of the document.

  • Each chunk will contain a heading and its corresponding content.
  • All related content (paragraphs, lists, tables) under the same heading are grouped together.

For example, if you upload the following HTML:

<h1>Introduction</h1>
<p>Hey, this is Trieve!</p>
<h2>Background</h2>
<p>Trieve brings AI search to the modern world.</p>

The response will be:

{
  "chunks": [
    {
      "headings": ["Introduction"],
      "body": "<h1>Introduction</h1><p>Hey, this is Trieve!</p>"
    },
    {
      "headings": ["Background"],
      "body": "<h2>Background</h2><p>Trieve brings AI search to the modern world.</p>"
    }
  ]
}

Grouping with File Upload

When uploading a file to Trieve, all chunks created from the file will be grouped together. The file_name field is used to specify the name of the resulting groups. Once uploaded, documents can be queried using the file_name field, allowing you to retrieve and perform operations on all chunks created from the file.