> ## Documentation Index
> Fetch the complete documentation index at: https://docs.trieve.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Upload File

> Upload a file to S3 bucket attached to your dataset. You can select between a naive chunking strategy where the text is extracted with Apache Tika and split into segments with a target number of segments per chunk OR you can use a vision LLM to convert the file to markdown and create chunks per page. You must specifically use a base64url encoding. Auth'ed user must be an admin or owner of the dataset's organization to upload a file.


## OpenAPI

````yaml post /api/file
openapi: 3.0.3
info:
  title: Trieve API
  description: >-
    Trieve OpenAPI Specification. This document describes all of the operations
    available through the Trieve API.
  contact:
    name: Trieve Team
    url: https://trieve.ai
    email: developers@trieve.ai
  license:
    name: BSL
    url: https://github.com/devflowinc/trieve/blob/main/LICENSE.txt
  version: 0.13.0
servers:
  - url: https://api.trieve.ai
    description: Production server
  - url: http://localhost:8090
    description: Local development server
security: []
tags:
  - name: Invitation
    description: Invitation endpoint. Exists to invite users to an organization.
  - name: Auth
    description: Authentication endpoint. Serves to register and authenticate users.
  - name: User
    description: User endpoint. Enables you to modify user roles and information.
  - name: Organization
    description: >-
      Organization endpoint. Enables you to modify organization roles and
      information.
  - name: Dataset
    description: >-
      Dataset endpoint. Datasets belong to organizations and hold configuration
      information for both client and server. Datasets contain chunks and chunk
      groups.
  - name: Chunk
    description: >-
      Chunk endpoint. Think of chunks as individual searchable units of
      information. The majority of your integration will likely be with the
      Chunk endpoint.
  - name: Chunk Group
    description: >-
      Chunk groups endpoint. Think of a chunk_group as a bookmark folder within
      the dataset.
  - name: Crawl
    description: Crawl endpoint. Used to create and manage crawls for datasets.
  - name: File
    description: >-
      File endpoint. When files are uploaded, they are stored in S3 and broken
      up into chunks with text extraction from Apache Tika. You can upload files
      of pretty much any type up to 1GB in size. See chunking algorithm details
      at `docs.trieve.ai` for more information on how chunking works. Improved
      default chunking is on our roadmap.
  - name: Events
    description: >-
      Notifications endpoint. Files are uploaded asynchronously and events are
      sent to the user when the upload is complete.
  - name: Topic
    description: >-
      Topic chat endpoint. Think of topics as the storage system for gen-ai chat
      memory. Gen AI messages belong to topics.
  - name: Message
    description: >-
      Message chat endpoint. Messages are units belonging to a topic in the
      context of a chat with a LLM. There are system, user, and assistant
      messages.
  - name: Stripe
    description: >-
      Stripe endpoint. Used for the managed SaaS version of this app. Eventually
      this will become a micro-service. Reach out to the team using contact info
      found at `docs.trieve.ai` for more information.
  - name: Health
    description: Health check endpoint. Used to check if the server is up and running.
  - name: Metrics
    description: Metrics endpoint. Used to get information for monitoring
  - name: Analytics
    description: Analytics endpoint. Used to get information for search and RAG analytics
  - name: Experiment
    description: Experiment endpoint. Used to create and manage experiments
paths:
  /api/file:
    post:
      tags:
        - File
      summary: Upload File
      description: >-
        Upload a file to S3 bucket attached to your dataset. You can select
        between a naive chunking strategy where the text is extracted with
        Apache Tika and split into segments with a target number of segments per
        chunk OR you can use a vision LLM to convert the file to markdown and
        create chunks per page. You must specifically use a base64url encoding.
        Auth'ed user must be an admin or owner of the dataset's organization to
        upload a file.
      operationId: upload_file_handler
      parameters:
        - name: TR-Dataset
          in: header
          description: >-
            The dataset id or tracking_id to use for the request. We assume you
            intend to use an id if the value is a valid uuid.
          required: true
          schema:
            type: string
            format: uuid
      requestBody:
        description: JSON request payload to upload a file
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/UploadFileReqPayload'
        required: true
      responses:
        '200':
          description: Confirmation that the file is uploading
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/UploadFileResponseBody'
        '400':
          description: Service error relating to uploading the file
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponseBody'
      security:
        - ApiKey:
            - admin
components:
  schemas:
    UploadFileReqPayload:
      type: object
      required:
        - base64_file
        - file_name
      properties:
        base64_file:
          type: string
          description: Base64 encoded file.
        chunkr_create_task_req_payload:
          allOf:
            - $ref: '#/components/schemas/CreateFormWithoutFile'
          nullable: true
        create_chunks:
          type: boolean
          description: >-
            Create chunks is a boolean which determines whether or not to create
            chunks from the file. If false, you can manually chunk the file and
            send the chunks to the create_chunk endpoint with the file_id to
            associate chunks with the file. Meant mostly for advanced users.
          nullable: true
        description:
          type: string
          description: >-
            Description is an optional convience field so you do not have to
            remember what the file contains or is about. It will be included on
            the group resulting from the file which will hold its chunk.
          nullable: true
        file_name:
          type: string
          description: Name of the file being uploaded, including the extension.
        group_tracking_id:
          type: string
          description: >-
            Group tracking id is an optional field which allows you to specify
            the tracking id of the group that is created from the file. Chunks
            created will be created with the tracking id of
            `group_tracking_id|<index of chunk>`
          nullable: true
        link:
          type: string
          description: >-
            Link to the file. This can also be any string. This can be used to
            filter when searching for the file's resulting chunks. The link
            value will not affect embedding creation.
          nullable: true
        metadata:
          description: >-
            Metadata is a JSON object which can be used to filter chunks. This
            is useful for when you want to filter chunks by arbitrary metadata.
            Unlike with tag filtering, there is a performance hit for filtering
            on metadata. Will be passed down to the file's chunks.
          nullable: true
        pdf2md_options:
          allOf:
            - $ref: '#/components/schemas/Pdf2MdOptions'
          nullable: true
        rebalance_chunks:
          type: boolean
          description: >-
            Rebalance chunks is an optional field which allows you to specify
            whether or not to rebalance the chunks created from the file. If not
            specified, the default true is used. If true, Trieve will evenly
            distribute remainder splits across chunks such that 66 splits with a
            `target_splits_per_chunk` of 20 will result in 3 chunks with 22
            splits each.
          nullable: true
        split_avg:
          type: boolean
          description: >-
            Split average will automatically split your file into multiple
            chunks and average all of the resulting vectors into a single output
            chunk. Default is false. Explicitly enabling this will cause each
            file to only produce a single chunk.
          nullable: true
        split_delimiters:
          type: array
          items:
            type: string
          description: >-
            Split delimiters is an optional field which allows you to specify
            the delimiters to use when splitting the file before chunking the
            text. If not specified, the default [.!?\n] are used to split into
            sentences. However, you may want to use spaces or other delimiters.
          nullable: true
        tag_set:
          type: array
          items:
            type: string
          description: >-
            Tag set is a comma separated list of tags which will be passed down
            to the chunks made from the file. Tags are used to filter chunks
            when searching. HNSW indices are created for each tag such that
            there is no performance loss when filtering on them.
          nullable: true
        target_splits_per_chunk:
          type: integer
          description: >-
            Target splits per chunk. This is an optional field which allows you
            to specify the number of splits you want per chunk. If not
            specified, the default 20 is used. However, you may want to use a
            different number.
          nullable: true
          minimum: 0
        time_stamp:
          type: string
          description: >-
            Time stamp should be an ISO 8601 combined date and time without
            timezone. Time_stamp is used for time window filtering and
            recency-biasing search results. Will be passed down to the file's
            chunks.
          nullable: true
        webhook_url:
          type: string
          description: >-
            Optional webhook URL to receive notifications for each page
            processed.
          nullable: true
      example:
        base64_file: <base64_encoded_file>
        create_chunks: true
        description: This is an example file
        file_name: example.pdf
        link: https://example.com
        metadata:
          key1: value1
          key2: value2
        split_delimiters:
          - ','
          - .
          - |+

        tag_set:
          - tag1
          - tag2
        target_splits_per_chunk: 20
        time_stamp: '2021-01-01 00:00:00.000Z'
        use_pdf2md_ocr: false
    UploadFileResponseBody:
      type: object
      required:
        - file_metadata
      properties:
        file_metadata:
          $ref: '#/components/schemas/File'
    ErrorResponseBody:
      type: object
      required:
        - message
      properties:
        message:
          type: string
      example:
        message: Bad Request
    CreateFormWithoutFile:
      type: object
      description: >-
        Will use [chunkr.ai](https://chunkr.ai) to process the file when this
        object is defined. See
        [docs.chunkr.ai/api-references/task/create-task](https://docs.chunkr.ai/api-references/task/create-task)
        for detailed information about what each field on this request payload
        does.
      properties:
        chunk_processing:
          allOf:
            - $ref: '#/components/schemas/ChunkProcessing'
          nullable: true
        error_handling:
          allOf:
            - $ref: '#/components/schemas/ErrorHandlingStrategy'
          nullable: true
        expires_in:
          type: integer
          format: int32
          description: >-
            The number of seconds until task is deleted.

            Expried tasks can **not** be updated, polled or accessed via web
            interface.
          nullable: true
        high_resolution:
          type: boolean
          description: >-
            Whether to use high-resolution images for cropping and
            post-processing. (Latency penalty: ~7 seconds per page)
          default: false
          nullable: true
        llm_processing:
          allOf:
            - $ref: '#/components/schemas/LlmProcessing'
          nullable: true
        ocr_strategy:
          allOf:
            - $ref: '#/components/schemas/OcrStrategy'
          default: All
          nullable: true
        pipeline:
          allOf:
            - $ref: '#/components/schemas/PipelineType'
          default: Azure
          nullable: true
        segment_processing:
          allOf:
            - $ref: '#/components/schemas/SegmentProcessing'
          nullable: true
        segmentation_strategy:
          allOf:
            - $ref: '#/components/schemas/SegmentationStrategy'
          default: LayoutAnalysis
          nullable: true
    Pdf2MdOptions:
      type: object
      description: >-
        We plan to deprecate pdf2md in favor of chunkr.ai. This is a legacy
        option for using a vision LLM to convert a given file into markdown and
        then ingest it.
      required:
        - use_pdf2md_ocr
      properties:
        split_headings:
          type: boolean
          description: >-
            Split headings is an optional field which allows you to specify
            whether or not to split headings into separate chunks. Default is
            false.
          nullable: true
        system_prompt:
          type: string
          description: Prompt to use for the gpt-4o model. Default is None.
          nullable: true
        use_pdf2md_ocr:
          type: boolean
          description: >-
            Parameter to use pdf2md_ocr. If true, the file will be converted to
            markdown using gpt-4o. Default is false.
    File:
      type: object
      required:
        - id
        - file_name
        - created_at
        - updated_at
        - dataset_id
        - size
      properties:
        created_at:
          type: string
          format: date-time
        dataset_id:
          type: string
          format: uuid
        file_name:
          type: string
        id:
          type: string
          format: uuid
        link:
          type: string
          nullable: true
        metadata:
          nullable: true
        size:
          type: integer
          format: int64
        tag_set:
          type: array
          items:
            type: string
            nullable: true
          nullable: true
        time_stamp:
          type: string
          format: date-time
          nullable: true
        updated_at:
          type: string
          format: date-time
      example:
        created_at: '2021-01-01 00:00:00.000'
        dataset_id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3
        file_name: file.txt
        id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3
        link: https://trieve.ai
        metadata:
          key: value
        size: 1000
        tag_set: tag1,tag2
        time_stamp: '2021-01-01 00:00:00.000'
        updated_at: '2021-01-01 00:00:00.000'
    ChunkProcessing:
      type: object
      description: Controls the setting for the chunking and post-processing of each chunk.
      properties:
        ignore_headers_and_footers:
          type: boolean
          description: >-
            Whether to ignore headers and footers in the chunking process.

            This is recommended as headers and footers break reading order
            across pages.
          default: true
        target_length:
          type: integer
          format: int32
          description: >-
            The target number of words in each chunk. If 0, each chunk will
            contain a single segment.
          default: 512
          minimum: 0
        tokenizer:
          allOf:
            - $ref: '#/components/schemas/TokenizerType'
          default: Word
    ErrorHandlingStrategy:
      type: string
      description: >-
        Controls how errors are handled during processing:

        - `Fail`: Stops processing and fails the task when any error occurs

        - `Continue`: Attempts to continue processing despite non-critical
        errors (eg. LLM refusals etc.)
      enum:
        - Fail
        - Continue
    LlmProcessing:
      type: object
      description: Controls the LLM used for the task.
      properties:
        fallback_strategy:
          $ref: '#/components/schemas/FallbackStrategy'
        max_completion_tokens:
          type: integer
          format: int32
          description: The maximum number of tokens to generate.
          nullable: true
          minimum: 0
        model_id:
          type: string
          description: >-
            The ID of the model to use for the task. If not provided, the
            default model will be used.

            Please check the documentation for the model you want to use.
          nullable: true
        temperature:
          type: number
          format: float
          description: The temperature to use for the LLM.
    OcrStrategy:
      type: string
      description: >-
        Controls the Optical Character Recognition (OCR) strategy.

        - `All`: Processes all pages with OCR. (Latency penalty: ~0.5 seconds
        per page)

        - `Auto`: Selectively applies OCR only to pages with missing or
        low-quality text. When text layer is present the bounding boxes from the
        text layer are used.
      enum:
        - All
        - Auto
    PipelineType:
      type: string
      enum:
        - Azure
        - Chunkr
    SegmentProcessing:
      type: object
      description: >-
        Controls the post-processing of each segment type.


        Allows you to generate HTML and Markdown from chunkr models for each
        segment type.

        By default, the HTML and Markdown are generated manually using the
        segmentation information except for `Table`, `Formula` and `Picture`.

        You can optionally configure custom LLM prompts and models to generate
        an additional `llm` field with LLM-processed content for each segment
        type.


        The configuration of which content sources (HTML, Markdown, LLM,
        Content) of the segment

        should be included in the chunk's `embed` field and counted towards the
        chunk length can be configured through the `embed_sources` setting.
      properties:
        Caption:
          allOf:
            - $ref: '#/components/schemas/AutoGenerationConfig'
          nullable: true
        Footnote:
          allOf:
            - $ref: '#/components/schemas/AutoGenerationConfig'
          nullable: true
        Formula:
          allOf:
            - $ref: '#/components/schemas/LlmGenerationConfig'
          nullable: true
        ListItem:
          allOf:
            - $ref: '#/components/schemas/AutoGenerationConfig'
          nullable: true
        Page:
          allOf:
            - $ref: '#/components/schemas/LlmGenerationConfig'
          nullable: true
        PageFooter:
          allOf:
            - $ref: '#/components/schemas/AutoGenerationConfig'
          nullable: true
        PageHeader:
          allOf:
            - $ref: '#/components/schemas/AutoGenerationConfig'
          nullable: true
        Picture:
          allOf:
            - $ref: '#/components/schemas/PictureGenerationConfig'
          nullable: true
        SectionHeader:
          allOf:
            - $ref: '#/components/schemas/AutoGenerationConfig'
          nullable: true
        Table:
          allOf:
            - $ref: '#/components/schemas/LlmGenerationConfig'
          nullable: true
        Text:
          allOf:
            - $ref: '#/components/schemas/AutoGenerationConfig'
          nullable: true
        Title:
          allOf:
            - $ref: '#/components/schemas/AutoGenerationConfig'
          nullable: true
    SegmentationStrategy:
      type: string
      description: >-
        Controls the segmentation strategy:

        - `LayoutAnalysis`: Analyzes pages for layout elements (e.g., `Table`,
        `Picture`, `Formula`, etc.) using bounding boxes. Provides fine-grained
        segmentation and better chunking. (Latency penalty: ~TBD seconds per
        page).

        - `Page`: Treats each page as a single segment. Faster processing, but
        without layout element detection and only simple chunking.
      enum:
        - LayoutAnalysis
        - Page
    TokenizerType:
      oneOf:
        - type: object
          required:
            - Enum
          properties:
            Enum:
              $ref: '#/components/schemas/Tokenizer'
        - type: object
          required:
            - String
          properties:
            String:
              type: string
              description: |-
                Use any Hugging Face tokenizer by specifying its model ID
                Examples: "Qwen/Qwen-tokenizer", "facebook/bart-large"
      description: >-
        Specifies which tokenizer to use for the chunking process.


        This type supports two ways of specifying a tokenizer:

        1. Using a predefined tokenizer from the `Tokenizer` enum

        2. Using any Hugging Face tokenizer by providing its model ID as a
        string

        (e.g. "facebook/bart-large", "Qwen/Qwen-tokenizer", etc.)


        When using a string, any valid Hugging Face tokenizer ID can be
        specified,

        which will be loaded using the Hugging Face tokenizers library.
    FallbackStrategy:
      oneOf:
        - type: string
          description: No fallback will be used
          enum:
            - None
        - type: string
          description: Use the system default fallback model
          enum:
            - Default
        - type: object
          required:
            - Model
          properties:
            Model:
              type: string
              description: Use a specific model as fallback
      description: >-
        Specifies the fallback strategy for LLM processing


        This can be:

        1. None - No fallback will be used

        2. Default - The system default fallback model will be used

        3. Model - A specific model ID will be used as fallback (check the
        documentation for the models.)
    AutoGenerationConfig:
      type: object
      description: >-
        Controls the processing and generation for the segment.

        - `crop_image` controls whether to crop the file's images to the
        segment's bounding box.

        The cropped image will be stored in the segment's `image` field. Use
        `All` to always crop,

        or `Auto` to only crop when needed for post-processing.

        - `html` is the HTML output for the segment, generated either through
        huerstics (`Auto`) or using Chunkr fine-tuned models (`LLM`)

        - `llm` is the LLM-generated output for the segment, this uses
        off-the-shelf models to generate a custom output for the segment

        - `markdown` is the Markdown output for the segment, generated either
        through huerstics (`Auto`) or using Chunkr fine-tuned models (`LLM`)

        - `embed_sources` defines which content sources will be included in the
        chunk's embed field and counted towards the chunk length.

        The array's order determines the sequence in which content appears in
        the embed field (e.g., [Markdown, LLM] means Markdown content

        is followed by LLM content). This directly affects what content is
        available for embedding and retrieval.
      properties:
        crop_image:
          allOf:
            - $ref: '#/components/schemas/CroppingStrategy'
          default: Auto
        embed_sources:
          type: array
          items:
            $ref: '#/components/schemas/EmbedSource'
          default: '[Markdown]'
        html:
          allOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: Auto
        llm:
          type: string
          description: Prompt for the LLM mode
          nullable: true
        markdown:
          allOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: Auto
    LlmGenerationConfig:
      type: object
      description: >-
        Controls the processing and generation for the segment.

        - `crop_image` controls whether to crop the file's images to the
        segment's bounding box.

        The cropped image will be stored in the segment's `image` field. Use
        `All` to always crop,

        or `Auto` to only crop when needed for post-processing.

        - `html` is the HTML output for the segment, generated either through
        huerstics (`Auto`) or using Chunkr fine-tuned models (`LLM`)

        - `llm` is the LLM-generated output for the segment, this uses
        off-the-shelf models to generate a custom output for the segment

        - `markdown` is the Markdown output for the segment, generated either
        through huerstics (`Auto`) or using Chunkr fine-tuned models (`LLM`)

        - `embed_sources` defines which content sources will be included in the
        chunk's embed field and counted towards the chunk length.

        The array's order determines the sequence in which content appears in
        the embed field (e.g., [Markdown, LLM] means Markdown content

        is followed by LLM content). This directly affects what content is
        available for embedding and retrieval.
      properties:
        crop_image:
          allOf:
            - $ref: '#/components/schemas/CroppingStrategy'
          default: Auto
        embed_sources:
          type: array
          items:
            $ref: '#/components/schemas/EmbedSource'
          default: '[Markdown]'
        html:
          allOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: LLM
        llm:
          type: string
          description: Prompt for the LLM model
          nullable: true
        markdown:
          allOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: LLM
    PictureGenerationConfig:
      type: object
      description: >-
        Controls the processing and generation for the segment.

        - `crop_image` controls whether to crop the file's images to the
        segment's bounding box.

        The cropped image will be stored in the segment's `image` field. Use
        `All` to always crop,

        or `Auto` to only crop when needed for post-processing.

        - `html` is the HTML output for the segment, generated either through
        huerstics (`Auto`) or using Chunkr fine-tuned models (`LLM`)

        - `llm` is the LLM-generated output for the segment, this uses
        off-the-shelf models to generate a custom output for the segment

        - `markdown` is the Markdown output for the segment, generated either
        through huerstics (`Auto`) or using Chunkr fine-tuned models (`LLM`)

        - `embed_sources` defines which content sources will be included in the
        chunk's embed field and counted towards the chunk length.

        The array's order determines the sequence in which content appears in
        the embed field (e.g., [Markdown, LLM] means Markdown content

        is followed by LLM content). This directly affects what content is
        available for embedding and retrieval.
      properties:
        crop_image:
          allOf:
            - $ref: '#/components/schemas/PictureCroppingStrategy'
          default: All
        embed_sources:
          type: array
          items:
            $ref: '#/components/schemas/EmbedSource'
          default: '[Markdown]'
        html:
          allOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: LLM
        llm:
          type: string
          description: Prompt for the LLM model
          nullable: true
        markdown:
          allOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: LLM
    Tokenizer:
      type: string
      description: >-
        Common tokenizers used for text processing.


        These values represent standard tokenization approaches and popular
        pre-trained

        tokenizers from the Hugging Face ecosystem.
      enum:
        - Word
        - Cl100kBase
        - XlmRobertaBase
        - BertBaseUncased
    CroppingStrategy:
      type: string
      description: |-
        Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
        - `All` crops all images in the item
        - `Auto` crops images only if required for post-processing
      enum:
        - All
        - Auto
    EmbedSource:
      type: string
      enum:
        - HTML
        - Markdown
        - LLM
        - Content
    GenerationStrategy:
      type: string
      enum:
        - LLM
        - Auto
    PictureCroppingStrategy:
      type: string
      description: |-
        Controls the cropping strategy for an item (e.g. segment, chunk, etc.)
        - `All` crops all images in the item
        - `Auto` crops images only if required for post-processing
      enum:
        - All
        - Auto
  securitySchemes:
    ApiKey:
      type: apiKey
      in: header
      name: Authorization

````