Create or Upsert Chunk or Chunks

curl --request POST \
  --url https://api.trieve.ai/api/chunk \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: application/json' \
  --header 'TR-Dataset: <tr-dataset>' \
  --data '{
  "chunk_html": "<p>Some HTML content</p>",
  "fulltext_boost": {
    "boost_factor": 5,
    "phrase": "foo"
  },
  "group_ids": [
    "d290f1ee-6c54-4b01-90e6-d701748f0851"
  ],
  "group_tracking_ids": [
    "group_tracking_id"
  ],
  "image_urls": [
    "https://example.com/red",
    "https://example.com/blue"
  ],
  "link": "https://example.com",
  "location": {
    "lat": -34,
    "lon": 151
  },
  "metadata": {
    "key1": "value1",
    "key2": "value2"
  },
  "semantic_boost": {
    "distance_factor": 0.5,
    "phrase": "flagship"
  },
  "tag_set": [
    "tag1",
    "tag2"
  ],
  "time_stamp": "2021-01-01 00:00:00.000",
  "tracking_id": "tracking_id"
}'

{
  "chunk_metadata": [
    {
      "content": "Some content",
      "link": "https://example.com",
      "metadata": {
        "key1": "value1",
        "key2": "value2"
      },
      "tag_set": [
        "tag1",
        "tag2"
      ],
      "time_stamp": "2021-01-01 00:00:00.000",
      "tracking_id": "tracking_id",
      "weight": 0.5
    }
  ],
  "pos_in_queue": 1
}

POST

api

chunk

Create or Upsert Chunk or Chunks

curl --request POST \
  --url https://api.trieve.ai/api/chunk \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: application/json' \
  --header 'TR-Dataset: <tr-dataset>' \
  --data '{
  "chunk_html": "<p>Some HTML content</p>",
  "fulltext_boost": {
    "boost_factor": 5,
    "phrase": "foo"
  },
  "group_ids": [
    "d290f1ee-6c54-4b01-90e6-d701748f0851"
  ],
  "group_tracking_ids": [
    "group_tracking_id"
  ],
  "image_urls": [
    "https://example.com/red",
    "https://example.com/blue"
  ],
  "link": "https://example.com",
  "location": {
    "lat": -34,
    "lon": 151
  },
  "metadata": {
    "key1": "value1",
    "key2": "value2"
  },
  "semantic_boost": {
    "distance_factor": 0.5,
    "phrase": "flagship"
  },
  "tag_set": [
    "tag1",
    "tag2"
  ],
  "time_stamp": "2021-01-01 00:00:00.000",
  "tracking_id": "tracking_id"
}'

{
  "chunk_metadata": [
    {
      "content": "Some content",
      "link": "https://example.com",
      "metadata": {
        "key1": "value1",
        "key2": "value2"
      },
      "tag_set": [
        "tag1",
        "tag2"
      ],
      "time_stamp": "2021-01-01 00:00:00.000",
      "tracking_id": "tracking_id",
      "weight": 0.5
    }
  ],
  "pos_in_queue": 1
}

Authorizations

Authorization

string

header

required

Headers

TR-Dataset

string<uuid>

required

The dataset id or tracking_id to use for the request. We assume you intend to use an id if the value is a valid uuid.

Body

application/json

JSON request payload to create a new chunk (chunk)

single · object
single · object[]

Request payload for creating a new chunk

chunk_html

string | null

HTML content of the chunk. This can also be plaintext. The innerText of the HTML will be used to create the embedding vector. The point of using HTML is for convienience, as some users have applications where users submit HTML content.

convert_html_to_text

boolean | null

Convert HTML to raw text before processing to avoid adding noise to the vector embeddings. By default this is true. If you are using HTML content that you want to be included in the vector embeddings, set this to false.

fulltext_boost

object

Boost the presence of certain tokens for fulltext (SPLADE) and keyword (BM25) search. I.e. boosting title phrases to priortize title matches or making sure that the listing for AirBNB itself ranks higher than companies who make software for AirBNB hosts by boosting the in-document-frequency of the AirBNB token (AKA word) for its official listing. Conceptually it multiples the in-document-importance second value in the tuples of the SPLADE or BM25 sparse vector of the chunk_html innerText for all tokens present in the boost phrase by the boost factor like so: (token, in-document-importance) -> (token, in-document-importance*boost_factor).

Show child attributes

fulltext_content

string | null

If fulltext_content is present, it will be used for creating the fulltext and bm25 sparse vectors instead of the innerText chunk_html. chunk_html will still be the only thing stored and used for semantic functionality unless the corresponding semantic_content field is defined. chunk_html must still be present for the chunk to be created properly.

group_ids

string<uuid>[] | null

Group ids are the Trieve generated ids of the groups that the chunk should be placed into. This is useful for when you want to create a chunk and add it to a group or multiple groups in one request. Groups with these Trieve generated ids must be created first, it cannot be arbitrarily created through this route.

group_tracking_ids

string[] | null

Group tracking_ids are the user-assigned tracking_ids of the groups that the chunk should be placed into. This is useful for when you want to create a chunk and add it to a group or multiple groups in one request. If a group with the tracking_id does not exist, it will be created.

high_priority

boolean | null

High Priority allows you to place this chunk into a priority queue with its own ingestion workers. Can only be used by users with a Custom Pro plan.

image_urls

string[] | null

Image urls are a list of urls to images that are associated with the chunk. This is useful for when you want to associate images with a chunk.

link

string | null

Link to the chunk. This can also be any string. Frequently, this is a link to the source of the chunk. The link value will not affect the embedding creation.

location

object

Location that you want to use as the center of the search.

Show child attributes

metadata

any

Metadata is a JSON object which can be used to filter chunks. This is useful for when you want to filter chunks by arbitrary metadata. Unlike with tag filtering, there is a performance hit for filtering on metadata.

num_value

number | null

Num value is an arbitrary numerical value that can be used to filter chunks. This is useful for when you want to filter chunks by numerical value. There is no performance hit for filtering on num_value.

semantic_boost

object

Semantic boosting moves the dense vector of the chunk in the direction of the distance phrase for semantic search. I.e. you can force a cluster by moving every chunk for a PDF closer to its title or push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factor*L2Distance closer to or away from the distance_phrase point along the line between the two points.

Show child attributes

semantic_content

string | null

If semantic_content is present, it will be used for creating semantic embeddings instead of the innerText chunk_html. chunk_html will still be the only thing stored and used for fulltext functionality unless the corresponding fulltext_content field is defined. chunk_html must still be present for the chunk to be created properly.

split_avg

boolean | null

Split avg is a boolean which tells the server to split the text in the chunk_html into smaller chunks and average their resulting vectors. This is useful for when you want to create a chunk from a large piece of text and want to split it into smaller chunks to create a more fuzzy average dense vector. The sparse vector will be generated normally with no averaging. By default this is false.

tag_set

string[] | null

Tag set is a list of tags. This can be used to filter chunks by tag. Unlike with metadata filtering, HNSW indices will exist for each tag such that there is not a performance hit for filtering on them.

time_stamp

string | null

Time_stamp should be an ISO 8601 combined date and time without timezone. It is used for time window filtering and recency-biasing search results.

tracking_id

string | null

Tracking_id is a string which can be used to identify a chunk. This is useful for when you are coordinating with an external system and want to use the tracking_id to identify the chunk.

upsert_by_tracking_id

boolean | null

Upsert when a chunk with the same tracking_id exists. By default this is false, and chunks will be ignored if another with the same tracking_id exists. If this is true, the chunk will be updated if a chunk with the same tracking_id exists.

weight

number | null

Weight is a float which can be used to bias search results. This is useful for when you want to bias search results for a chunk. The magnitude only matters relative to other chunks in the chunk's dataset dataset.

Response

JSON response payload containing the created chunk

Option 1
batch

chunk_metadata

object

required

Show child attributes

Example:

{
  "chunk_html": "<p>Hello, world!</p>",
  "created_at": "2021-01-01 00:00:00.000",
  "dataset_id": "e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3",
  "id": "e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3",
  "link": "https://trieve.ai",
  "metadata": { "key": "value" },
  "tag_set": "[tag1,tag2]",
  "time_stamp": "2021-01-01 00:00:00.000",
  "tracking_id": "e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3",
  "updated_at": "2021-01-01 00:00:00.000",
  "weight": 0.5
}

⌘I

Chunk

Chunk Group

Topic

Message

Crawl

File

Analytics

Experiments

Dataset

Organization

User

Auth

Health

Invitation

Stripe

Metrics

Public

Create or Upsert Chunk or Chunks

Authorizations

Headers

Body

Response