> ## Documentation Index
> Fetch the complete documentation index at: https://docs.trieve.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Create or Upsert Chunk or Chunks

> Create new chunk(s). If the chunk has the same tracking_id as an existing chunk, the request will fail. Once a chunk is created, it can be searched for using the search endpoint.
If uploading in bulk, the maximum amount of chunks that can be uploaded at once is 120 chunks. Auth'ed user or api key must have an admin or owner role for the specified dataset's organization.


## OpenAPI

````yaml post /api/chunk
openapi: 3.0.3
info:
  title: Trieve API
  description: >-
    Trieve OpenAPI Specification. This document describes all of the operations
    available through the Trieve API.
  contact:
    name: Trieve Team
    url: https://trieve.ai
    email: developers@trieve.ai
  license:
    name: BSL
    url: https://github.com/devflowinc/trieve/blob/main/LICENSE.txt
  version: 0.13.0
servers:
  - url: https://api.trieve.ai
    description: Production server
  - url: http://localhost:8090
    description: Local development server
security: []
tags:
  - name: Invitation
    description: Invitation endpoint. Exists to invite users to an organization.
  - name: Auth
    description: Authentication endpoint. Serves to register and authenticate users.
  - name: User
    description: User endpoint. Enables you to modify user roles and information.
  - name: Organization
    description: >-
      Organization endpoint. Enables you to modify organization roles and
      information.
  - name: Dataset
    description: >-
      Dataset endpoint. Datasets belong to organizations and hold configuration
      information for both client and server. Datasets contain chunks and chunk
      groups.
  - name: Chunk
    description: >-
      Chunk endpoint. Think of chunks as individual searchable units of
      information. The majority of your integration will likely be with the
      Chunk endpoint.
  - name: Chunk Group
    description: >-
      Chunk groups endpoint. Think of a chunk_group as a bookmark folder within
      the dataset.
  - name: Crawl
    description: Crawl endpoint. Used to create and manage crawls for datasets.
  - name: File
    description: >-
      File endpoint. When files are uploaded, they are stored in S3 and broken
      up into chunks with text extraction from Apache Tika. You can upload files
      of pretty much any type up to 1GB in size. See chunking algorithm details
      at `docs.trieve.ai` for more information on how chunking works. Improved
      default chunking is on our roadmap.
  - name: Events
    description: >-
      Notifications endpoint. Files are uploaded asynchronously and events are
      sent to the user when the upload is complete.
  - name: Topic
    description: >-
      Topic chat endpoint. Think of topics as the storage system for gen-ai chat
      memory. Gen AI messages belong to topics.
  - name: Message
    description: >-
      Message chat endpoint. Messages are units belonging to a topic in the
      context of a chat with a LLM. There are system, user, and assistant
      messages.
  - name: Stripe
    description: >-
      Stripe endpoint. Used for the managed SaaS version of this app. Eventually
      this will become a micro-service. Reach out to the team using contact info
      found at `docs.trieve.ai` for more information.
  - name: Health
    description: Health check endpoint. Used to check if the server is up and running.
  - name: Metrics
    description: Metrics endpoint. Used to get information for monitoring
  - name: Analytics
    description: Analytics endpoint. Used to get information for search and RAG analytics
  - name: Experiment
    description: Experiment endpoint. Used to create and manage experiments
paths:
  /api/chunk:
    post:
      tags:
        - Chunk
      summary: Create or Upsert Chunk or Chunks
      description: >-
        Create new chunk(s). If the chunk has the same tracking_id as an
        existing chunk, the request will fail. Once a chunk is created, it can
        be searched for using the search endpoint.

        If uploading in bulk, the maximum amount of chunks that can be uploaded
        at once is 120 chunks. Auth'ed user or api key must have an admin or
        owner role for the specified dataset's organization.
      operationId: create_chunk
      parameters:
        - name: TR-Dataset
          in: header
          description: >-
            The dataset id or tracking_id to use for the request. We assume you
            intend to use an id if the value is a valid uuid.
          required: true
          schema:
            type: string
            format: uuid
      requestBody:
        description: JSON request payload to create a new chunk (chunk)
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/CreateChunkReqPayloadEnum'
        required: true
      responses:
        '200':
          description: JSON response payload containing the created chunk
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ReturnQueuedChunk'
        '400':
          description: Error typically due to deserialization issues
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponseBody'
        '413':
          description: Error when more than 120 chunks are provided in bulk
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponseBody'
        '426':
          description: Error when upgrade is needed to process more chunks
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponseBody'
      security:
        - ApiKey:
            - admin
components:
  schemas:
    CreateChunkReqPayloadEnum:
      oneOf:
        - $ref: '#/components/schemas/CreateSingleChunkReqPayload'
        - $ref: '#/components/schemas/CreateBatchChunkReqPayload'
    ReturnQueuedChunk:
      oneOf:
        - $ref: '#/components/schemas/SingleQueuedChunkResponse'
        - $ref: '#/components/schemas/BatchQueuedChunkResponse'
    ErrorResponseBody:
      type: object
      required:
        - message
      properties:
        message:
          type: string
      example:
        message: Bad Request
    CreateSingleChunkReqPayload:
      $ref: '#/components/schemas/ChunkReqPayload'
    CreateBatchChunkReqPayload:
      type: array
      items:
        $ref: '#/components/schemas/ChunkReqPayload'
      example:
        - chunk_html: <p>Some HTML content</p>
          group_ids:
            - d290f1ee-6c54-4b01-90e6-d701748f0851
          group_tracking_ids:
            - group_tracking_id
          image_urls:
            - https://example.com/red
            - https://example.com/blue
          link: https://example.com
          location:
            lat: -34
            lon: 151
          metadata:
            key1: value1
            key2: value2
          tag_set:
            - tag1
            - tag2
          time_stamp: '2021-01-01 00:00:00.000'
          tracking_id: tracking_id
          upsert_by_tracking_id: true
        - chunk_html: <p>Some more HTML content</p>
          group_ids:
            - d290f1ee-6c54-4b01-90e6-d701748f0851
          group_tracking_ids:
            - group_tracking_id
          image_urls: []
          link: https://explain.com
          location:
            lat: -34
            lon: 151
          metadata:
            key1: value1
            key2: value2
          tag_set:
            - tag3
            - tag4
          time_stamp: '2021-01-01 00:00:00.000'
          tracking_id: tracking_id
          upsert_by_tracking_id: true
          weight: 0.5
    SingleQueuedChunkResponse:
      type: object
      required:
        - chunk_metadata
      properties:
        chunk_metadata:
          $ref: '#/components/schemas/ChunkMetadata'
      example:
        chunk_metadata:
          - content: Some content
            link: https://example.com
            metadata:
              key1: value1
              key2: value2
            tag_set:
              - tag1
              - tag2
            time_stamp: '2021-01-01 00:00:00.000'
            tracking_id: tracking_id
            weight: 0.5
        pos_in_queue: 1
    BatchQueuedChunkResponse:
      type: object
      title: batch
      required:
        - chunk_metadata
      properties:
        chunk_metadata:
          type: array
          items:
            $ref: '#/components/schemas/ChunkMetadata'
      example:
        chunk_metadata:
          - content: Some content
            file_id: d290f1ee-6c54-4b01-90e6-d701748f0851
            link: https://example.com
            metadata:
              key1: value1
              key2: value2
            tag_set:
              - tag1
              - tag2
            time_stamp: '2021-01-01 00:00:00.000'
            tracking_id: tracking_id
            weight: 0.5
          - content: Some content
            file_id: d290f1ee-6c54-4b01-90e6-d701748f0851
            link: https://example.com
            metadata:
              key1: value1
              key2: value2
            tag_set:
              - tag1
              - tag2
            time_stamp: '2021-01-01 00:00:00.000'
            tracking_id: tracking_id
            weight: 0.5
        pos_in_queue: 2
    ChunkReqPayload:
      type: object
      title: single
      description: Request payload for creating a new chunk
      properties:
        chunk_html:
          type: string
          description: >-
            HTML content of the chunk. This can also be plaintext. The innerText
            of the HTML will be used to create the embedding vector. The point
            of using HTML is for convienience, as some users have applications
            where users submit HTML content.
          nullable: true
        convert_html_to_text:
          type: boolean
          description: >-
            Convert HTML to raw text before processing to avoid adding noise to
            the vector embeddings. By default this is true. If you are using
            HTML content that you want to be included in the vector embeddings,
            set this to false.
          nullable: true
        fulltext_boost:
          allOf:
            - $ref: '#/components/schemas/FullTextBoost'
          nullable: true
        fulltext_content:
          type: string
          description: >-
            If fulltext_content is present, it will be used for creating the
            fulltext and bm25 sparse vectors instead of the innerText
            `chunk_html`. `chunk_html` will still be the only thing stored and
            used for semantic functionality unless the corresponding
            `semantic_content` field is defined. `chunk_html` must still be
            present for the chunk to be created properly.
          nullable: true
        group_ids:
          type: array
          items:
            type: string
            format: uuid
          description: >-
            Group ids are the Trieve generated ids of the groups that the chunk
            should be placed into. This is useful for when you want to create a
            chunk and add it to a group or multiple groups in one request.
            Groups with these Trieve generated ids must be created first, it
            cannot be arbitrarily created through this route.
          nullable: true
        group_tracking_ids:
          type: array
          items:
            type: string
          description: >-
            Group tracking_ids are the user-assigned tracking_ids of the groups
            that the chunk should be placed into. This is useful for when you
            want to create a chunk and add it to a group or multiple groups in
            one request. If a group with the tracking_id does not exist, it will
            be created.
          nullable: true
        high_priority:
          type: boolean
          description: >-
            High Priority allows you to place this chunk into a priority queue
            with its own ingestion workers. Can only be used by users with a
            Custom Pro plan.
          nullable: true
        image_urls:
          type: array
          items:
            type: string
          description: >-
            Image urls are a list of urls to images that are associated with the
            chunk. This is useful for when you want to associate images with a
            chunk.
          nullable: true
        link:
          type: string
          description: >-
            Link to the chunk. This can also be any string. Frequently, this is
            a link to the source of the chunk. The link value will not affect
            the embedding creation.
          nullable: true
        location:
          allOf:
            - $ref: '#/components/schemas/GeoInfo'
          nullable: true
        metadata:
          description: >-
            Metadata is a JSON object which can be used to filter chunks. This
            is useful for when you want to filter chunks by arbitrary metadata.
            Unlike with tag filtering, there is a performance hit for filtering
            on metadata.
          nullable: true
        num_value:
          type: number
          format: double
          description: >-
            Num value is an arbitrary numerical value that can be used to filter
            chunks. This is useful for when you want to filter chunks by
            numerical value. There is no performance hit for filtering on
            num_value.
          nullable: true
        semantic_boost:
          allOf:
            - $ref: '#/components/schemas/SemanticBoost'
          nullable: true
        semantic_content:
          type: string
          description: >-
            If semantic_content is present, it will be used for creating
            semantic embeddings instead of the innerText `chunk_html`.
            `chunk_html` will still be the only thing stored and used for
            fulltext functionality unless the corresponding `fulltext_content`
            field is defined. `chunk_html` must still be present for the chunk
            to be created properly.
          nullable: true
        split_avg:
          type: boolean
          description: >-
            Split avg is a boolean which tells the server to split the text in
            the chunk_html into smaller chunks and average their resulting
            vectors. This is useful for when you want to create a chunk from a
            large piece of text and want to split it into smaller chunks to
            create a more fuzzy average dense vector. The sparse vector will be
            generated normally with no averaging. By default this is false.
          nullable: true
        tag_set:
          type: array
          items:
            type: string
          description: >-
            Tag set is a list of tags. This can be used to filter chunks by tag.
            Unlike with metadata filtering, HNSW indices will exist for each tag
            such that there is not a performance hit for filtering on them.
          nullable: true
        time_stamp:
          type: string
          description: >-
            Time_stamp should be an ISO 8601 combined date and time without
            timezone. It is used for time window filtering and recency-biasing
            search results.
          nullable: true
        tracking_id:
          type: string
          description: >-
            Tracking_id is a string which can be used to identify a chunk. This
            is useful for when you are coordinating with an external system and
            want to use the tracking_id to identify the chunk.
          nullable: true
        upsert_by_tracking_id:
          type: boolean
          description: >-
            Upsert when a chunk with the same tracking_id exists. By default
            this is false, and chunks will be ignored if another with the same
            tracking_id exists. If this is true, the chunk will be updated if a
            chunk with the same tracking_id exists.
          nullable: true
        weight:
          type: number
          format: double
          description: >-
            Weight is a float which can be used to bias search results. This is
            useful for when you want to bias search results for a chunk. The
            magnitude only matters relative to other chunks in the chunk's
            dataset dataset.
          nullable: true
      example:
        chunk_html: <p>Some HTML content</p>
        fulltext_boost:
          boost_factor: 5
          phrase: foo
        group_ids:
          - d290f1ee-6c54-4b01-90e6-d701748f0851
        group_tracking_ids:
          - group_tracking_id
        image_urls:
          - https://example.com/red
          - https://example.com/blue
        link: https://example.com
        location:
          lat: -34
          lon: 151
        metadata:
          key1: value1
          key2: value2
        semantic_boost:
          distance_factor: 0.5
          phrase: flagship
        tag_set:
          - tag1
          - tag2
        time_stamp: '2021-01-01 00:00:00.000'
        tracking_id: tracking_id
    ChunkMetadata:
      type: object
      title: V2
      required:
        - id
        - created_at
        - updated_at
        - dataset_id
        - weight
      properties:
        chunk_html:
          type: string
          description: >-
            HTML content of the chunk, can also be an arbitrary string which is
            not HTML
          nullable: true
        created_at:
          type: string
          format: date-time
          description: Timestamp of the creation of the chunk
        dataset_id:
          type: string
          format: uuid
          description: ID of the dataset which the chunk belongs to
        id:
          type: string
          format: uuid
          description: >-
            Unique identifier of the chunk, auto-generated uuid created by
            Trieve
        image_urls:
          type: array
          items:
            type: string
            nullable: true
          description: >-
            Image URLs of the chunk, can be any list of strings. Used for image
            search and RAG.
          nullable: true
        link:
          type: string
          description: Link to the chunk, should be a URL
          nullable: true
        location:
          allOf:
            - $ref: '#/components/schemas/GeoInfo'
          nullable: true
        metadata:
          description: Metadata of the chunk, can be any JSON object
          nullable: true
        num_value:
          type: number
          format: double
          description: >-
            Numeric value of the chunk, can be any float. Can represent the most
            relevant numeric value of the chunk, such as a price, quantity in
            stock, rating, etc.
          nullable: true
        tag_set:
          type: array
          items:
            type: string
            nullable: true
          description: >-
            Tag set of the chunk, can be any list of strings. Used for
            tag-filtered searches.
          nullable: true
        time_stamp:
          type: string
          format: date-time
          description: Timestamp of the chunk, can be any timestamp. Specified by the user.
          nullable: true
        tracking_id:
          type: string
          description: >-
            Tracking ID of the chunk, can be any string, determined by the user.
            Tracking ID's are unique identifiers for chunks within a dataset.
            They are designed to match the unique identifier of the chunk in the
            user's system.
          nullable: true
        updated_at:
          type: string
          format: date-time
          description: Timestamp of the last update of the chunk
        weight:
          type: number
          format: double
          description: >-
            Weight of the chunk, can be any float. Used as a multiplier on a
            chunk's relevance score for ranking purposes.
      example:
        chunk_html: <p>Hello, world!</p>
        created_at: '2021-01-01 00:00:00.000'
        dataset_id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3
        id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3
        link: https://trieve.ai
        metadata:
          key: value
        tag_set: '[tag1,tag2]'
        time_stamp: '2021-01-01 00:00:00.000'
        tracking_id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3
        updated_at: '2021-01-01 00:00:00.000'
        weight: 0.5
    FullTextBoost:
      type: object
      description: >-
        Boost the presence of certain tokens for fulltext (SPLADE) and keyword
        (BM25) search. I.e. boosting title phrases to priortize title matches or
        making sure that the listing for AirBNB itself ranks higher than
        companies who make software for AirBNB hosts by boosting the
        in-document-frequency of the AirBNB token (AKA word) for its official
        listing. Conceptually it multiples the in-document-importance second
        value in the tuples of the SPLADE or BM25 sparse vector of the
        chunk_html innerText for all tokens present in the boost phrase by the
        boost factor like so: (token, in-document-importance) -> (token,
        in-document-importance*boost_factor).
      required:
        - phrase
        - boost_factor
      properties:
        boost_factor:
          type: number
          format: double
          description: >-
            Amount to multiplicatevly increase the frequency of the tokens in
            the phrase by
        phrase:
          type: string
          description: The phrase to boost in the fulltext document frequency index
    GeoInfo:
      type: object
      description: Location that you want to use as the center of the search.
      required:
        - lat
        - lon
      properties:
        lat:
          $ref: '#/components/schemas/GeoTypes'
        lon:
          $ref: '#/components/schemas/GeoTypes'
    SemanticBoost:
      type: object
      description: >-
        Semantic boosting moves the dense vector of the chunk in the direction
        of the distance phrase for semantic search. I.e. you can force a cluster
        by moving every chunk for a PDF closer to its title or push a chunk with
        a chunk_html of "iphone" 25% closer to the term "flagship" by using the
        distance phrase "flagship" and a distance factor of 0.25. Conceptually
        it's drawing a line (euclidean/L2 distance) between the vector for the
        innerText of the chunk_html and distance_phrase then moving the vector
        of the chunk_html distance_factor*L2Distance closer to or away from the
        distance_phrase point along the line between the two points.
      required:
        - phrase
        - distance_factor
      properties:
        distance_factor:
          type: number
          format: float
          description: >-
            Arbitrary float (positive or negative) specifying the multiplicate
            factor to apply before summing the phrase vector with the chunk_html
            embedding vector
        phrase:
          type: string
          description: >-
            Terms to embed in order to create the vector which is weighted
            summed with the chunk_html embedding vector
    GeoTypes:
      oneOf:
        - type: integer
          format: int64
        - type: number
          format: double
  securitySchemes:
    ApiKey:
      type: apiKey
      in: header
      name: Authorization

````