Chunk
- POSTCreate or Upsert Chunk or Chunks
- POSTSearch
- POSTAutocomplete
- POSTGet Recommended Chunks
- POSTScroll Chunks
- POSTCount chunks above threshold
- POSTGenerate suggested queries
- POSTRAG on Specified Chunks
- PUTUpdate Chunk
- PUTUpdate Chunk By Tracking Id
- GETGet Chunk By Id
- GETGet Chunk By Tracking Id
- POSTGet Chunks By Tracking Ids
- POSTGet Chunks By Ids
- DELDelete Chunk
- DELDelete Chunk By Tracking Id
- DELBulk Delete Chunks
- POSTSplit HTML Content into Chunks
Chunk Group
- POSTCreate or Upsert Group or Groups
- POSTSearch Over Groups
- POSTSearch Within Group
- POSTGet Recommended Groups
- POSTAdd Chunk to Group
- POSTAdd Chunk to Group by Tracking ID
- POSTGet Groups for Chunks
- GETGet Chunks in Group by Tracking ID
- GETGet Group by Tracking ID
- PUTUpdate Group
- DELRemove Chunk from Group
- DELDelete Group by Tracking ID
- DELDelete Group
- GETGet Group
- GETGet Chunks in Group
- GETGet Groups for Dataset
Message
Crawl
File
Analytics
Dataset
- POSTCreate Dataset
- POSTBatch Create Datasets
- POSTGet All Tags
- POSTGet events for the dataset
- PUTUpdate Dataset by ID or Tracking ID
- PUTClear Dataset
- GETGet Dataset By ID
- GETGet Dataset by Tracking ID
- GETGet Datasets from Organization
- POSTCreate ETL Job
- PUTCreate Pagefind Index for Dataset
- GETGet Pagefind Index Url for Dataset
- GETGet Usage By Dataset ID
- GETGet dataset crawl options
- GETGet apipublic page
- DELDelete Dataset
- DELDelete Dataset by Tracking ID
Organization
Health
Stripe
Metrics
Upload File
Upload a file to S3 bucket attached to your dataset. You can select between a naive chunking strategy where the text is extracted with Apache Tika and split into segments with a target number of segments per chunk OR you can use a vision LLM to convert the file to markdown and create chunks per page. You must specifically use a base64url encoding. Auth’ed user must be an admin or owner of the dataset’s organization to upload a file.
curl --request POST \
--url https://api.trieve.ai/api/file \
--header 'Authorization: <api-key>' \
--header 'Content-Type: application/json' \
--header 'TR-Dataset: <tr-dataset>' \
--data '{
"base64_file": "<base64_encoded_file>",
"create_chunks": true,
"description": "This is an example file",
"file_name": "example.pdf",
"link": "https://example.com",
"metadata": {
"key1": "value1",
"key2": "value2"
},
"split_delimiters": [
",",
".",
"\n"
],
"tag_set": [
"tag1",
"tag2"
],
"target_splits_per_chunk": 20,
"time_stamp": "2021-01-01 00:00:00.000Z",
"use_pdf2md_ocr": false
}'
{
"file_metadata": {
"created_at": "2021-01-01 00:00:00.000",
"dataset_id": "e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3",
"file_name": "file.txt",
"id": "e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3",
"link": "https://trieve.ai",
"metadata": {
"key": "value"
},
"size": 1000,
"tag_set": "tag1,tag2",
"time_stamp": "2021-01-01 00:00:00.000",
"updated_at": "2021-01-01 00:00:00.000"
}
}
Authorizations
Headers
The dataset id or tracking_id to use for the request. We assume you intend to use an id if the value is a valid uuid.
Body
Base64 encoded file.
Name of the file being uploaded, including the extension.
Will use chunkr.ai to process the file when this object is defined. See docs.chunkr.ai/api-references/task/create-task for detailed information about what each field on this request payload does.
Controls the setting for the chunking and post-processing of each chunk.
Whether to ignore headers and footers in the chunking process. This is recommended as headers and footers break reading order across pages.
The target number of words in each chunk. If 0, each chunk will contain a single segment.
x >= 0
Specifies which tokenizer to use for the chunking process.
This type supports two ways of specifying a tokenizer:
- Using a predefined tokenizer from the
Tokenizer
enum - Using any Hugging Face tokenizer by providing its model ID as a string (e.g. "facebook/bart-large", "Qwen/Qwen-tokenizer", etc.)
When using a string, any valid Hugging Face tokenizer ID can be specified, which will be loaded using the Hugging Face tokenizers library.
Common tokenizers used for text processing.
These values represent standard tokenization approaches and popular pre-trained tokenizers from the Hugging Face ecosystem.
Word
, Cl100kBase
, XlmRobertaBase
, BertBaseUncased
The number of seconds until task is deleted. Expried tasks can not be updated, polled or accessed via web interface.
Whether to use high-resolution images for cropping and post-processing. (Latency penalty: ~7 seconds per page)
Controls the Optical Character Recognition (OCR) strategy.
All
: Processes all pages with OCR. (Latency penalty: ~0.5 seconds per page)Auto
: Selectively applies OCR only to pages with missing or low-quality text. When text layer is present the bounding boxes from the text layer are used.
All
, Auto
Azure
, Chunkr
Controls the post-processing of each segment type.
Allows you to generate HTML and Markdown from chunkr models for each segment type.
By default, the HTML and Markdown are generated manually using the segmentation information except for Table
, Formula
and Picture
.
You can optionally configure custom LLM prompts and models to generate an additional llm
field with LLM-processed content for each segment type.
The configuration of which content sources (HTML, Markdown, LLM, Content) of the segment
should be included in the chunk's embed
field and counted towards the chunk length can be configured through the embed_sources
setting.
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM mode
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM mode
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM model
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM mode
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM model
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM mode
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM mode
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM model
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM mode
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM model
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM mode
LLM
, Auto
Controls the processing and generation for the segment.
crop_image
controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment'simage
field. UseAll
to always crop, orAuto
to only crop when needed for post-processing.html
is the HTML output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)llm
is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segmentmarkdown
is the Markdown output for the segment, generated either through huerstics (Auto
) or using Chunkr fine-tuned models (LLM
)embed_sources
defines which content sources will be included in the chunk's embed field and counted towards the chunk length. The array's order determines the sequence in which content appears in the embed field (e.g., [Markdown, LLM] means Markdown content is followed by LLM content). This directly affects what content is available for embedding and retrieval.
Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
All
crops all images in the itemAuto
crops images only if required for post-processing
All
, Auto
HTML
, Markdown
, LLM
, Content
LLM
, Auto
Prompt for the LLM mode
LLM
, Auto
Controls the segmentation strategy:
LayoutAnalysis
: Analyzes pages for layout elements (e.g.,Table
,Picture
,Formula
, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking. (Latency penalty: ~TBD seconds per page).Page
: Treats each page as a single segment. Faster processing, but without layout element detection and only simple chunking.
LayoutAnalysis
, Page
Create chunks is a boolean which determines whether or not to create chunks from the file. If false, you can manually chunk the file and send the chunks to the create_chunk endpoint with the file_id to associate chunks with the file. Meant mostly for advanced users.
Description is an optional convience field so you do not have to remember what the file contains or is about. It will be included on the group resulting from the file which will hold its chunk.
Group tracking id is an optional field which allows you to specify the tracking id of the group that is created from the file. Chunks created will be created with the tracking id of group_tracking_id|<index of chunk>
Link to the file. This can also be any string. This can be used to filter when searching for the file's resulting chunks. The link value will not affect embedding creation.
Metadata is a JSON object which can be used to filter chunks. This is useful for when you want to filter chunks by arbitrary metadata. Unlike with tag filtering, there is a performance hit for filtering on metadata. Will be passed down to the file's chunks.
We plan to deprecate pdf2md in favor of chunkr.ai. This is a legacy option for using a vision LLM to convert a given file into markdown and then ingest it.
Parameter to use pdf2md_ocr. If true, the file will be converted to markdown using gpt-4o. Default is false.
Split headings is an optional field which allows you to specify whether or not to split headings into separate chunks. Default is false.
Prompt to use for the gpt-4o model. Default is None.
Rebalance chunks is an optional field which allows you to specify whether or not to rebalance the chunks created from the file. If not specified, the default true is used. If true, Trieve will evenly distribute remainder splits across chunks such that 66 splits with a target_splits_per_chunk
of 20 will result in 3 chunks with 22 splits each.
Split average will automatically split your file into multiple chunks and average all of the resulting vectors into a single output chunk. Default is false. Explicitly enabling this will cause each file to only produce a single chunk.
Split delimiters is an optional field which allows you to specify the delimiters to use when splitting the file before chunking the text. If not specified, the default [.!?\n] are used to split into sentences. However, you may want to use spaces or other delimiters.
Tag set is a comma separated list of tags which will be passed down to the chunks made from the file. Tags are used to filter chunks when searching. HNSW indices are created for each tag such that there is no performance loss when filtering on them.
Target splits per chunk. This is an optional field which allows you to specify the number of splits you want per chunk. If not specified, the default 20 is used. However, you may want to use a different number.
x >= 0
Time stamp should be an ISO 8601 combined date and time without timezone. Time_stamp is used for time window filtering and recency-biasing search results. Will be passed down to the file's chunks.
Response
{
"created_at": "2021-01-01 00:00:00.000",
"dataset_id": "e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3",
"file_name": "file.txt",
"id": "e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3",
"link": "https://trieve.ai",
"metadata": { "key": "value" },
"size": 1000,
"tag_set": "tag1,tag2",
"time_stamp": "2021-01-01 00:00:00.000",
"updated_at": "2021-01-01 00:00:00.000"
}
Was this page helpful?
curl --request POST \
--url https://api.trieve.ai/api/file \
--header 'Authorization: <api-key>' \
--header 'Content-Type: application/json' \
--header 'TR-Dataset: <tr-dataset>' \
--data '{
"base64_file": "<base64_encoded_file>",
"create_chunks": true,
"description": "This is an example file",
"file_name": "example.pdf",
"link": "https://example.com",
"metadata": {
"key1": "value1",
"key2": "value2"
},
"split_delimiters": [
",",
".",
"\n"
],
"tag_set": [
"tag1",
"tag2"
],
"target_splits_per_chunk": 20,
"time_stamp": "2021-01-01 00:00:00.000Z",
"use_pdf2md_ocr": false
}'
{
"file_metadata": {
"created_at": "2021-01-01 00:00:00.000",
"dataset_id": "e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3",
"file_name": "file.txt",
"id": "e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3",
"link": "https://trieve.ai",
"metadata": {
"key": "value"
},
"size": 1000,
"tag_set": "tag1,tag2",
"time_stamp": "2021-01-01 00:00:00.000",
"updated_at": "2021-01-01 00:00:00.000"
}
}