> ## Documentation Index
> Fetch the complete documentation index at: https://docs.trieve.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Create a new crawl request

> This endpoint is used to create a new crawl request for a dataset. The request payload should contain the crawl options to use for the crawl.


## OpenAPI

````yaml post /api/crawl
openapi: 3.0.3
info:
  title: Trieve API
  description: >-
    Trieve OpenAPI Specification. This document describes all of the operations
    available through the Trieve API.
  contact:
    name: Trieve Team
    url: https://trieve.ai
    email: developers@trieve.ai
  license:
    name: BSL
    url: https://github.com/devflowinc/trieve/blob/main/LICENSE.txt
  version: 0.13.0
servers:
  - url: https://api.trieve.ai
    description: Production server
  - url: http://localhost:8090
    description: Local development server
security: []
tags:
  - name: Invitation
    description: Invitation endpoint. Exists to invite users to an organization.
  - name: Auth
    description: Authentication endpoint. Serves to register and authenticate users.
  - name: User
    description: User endpoint. Enables you to modify user roles and information.
  - name: Organization
    description: >-
      Organization endpoint. Enables you to modify organization roles and
      information.
  - name: Dataset
    description: >-
      Dataset endpoint. Datasets belong to organizations and hold configuration
      information for both client and server. Datasets contain chunks and chunk
      groups.
  - name: Chunk
    description: >-
      Chunk endpoint. Think of chunks as individual searchable units of
      information. The majority of your integration will likely be with the
      Chunk endpoint.
  - name: Chunk Group
    description: >-
      Chunk groups endpoint. Think of a chunk_group as a bookmark folder within
      the dataset.
  - name: Crawl
    description: Crawl endpoint. Used to create and manage crawls for datasets.
  - name: File
    description: >-
      File endpoint. When files are uploaded, they are stored in S3 and broken
      up into chunks with text extraction from Apache Tika. You can upload files
      of pretty much any type up to 1GB in size. See chunking algorithm details
      at `docs.trieve.ai` for more information on how chunking works. Improved
      default chunking is on our roadmap.
  - name: Events
    description: >-
      Notifications endpoint. Files are uploaded asynchronously and events are
      sent to the user when the upload is complete.
  - name: Topic
    description: >-
      Topic chat endpoint. Think of topics as the storage system for gen-ai chat
      memory. Gen AI messages belong to topics.
  - name: Message
    description: >-
      Message chat endpoint. Messages are units belonging to a topic in the
      context of a chat with a LLM. There are system, user, and assistant
      messages.
  - name: Stripe
    description: >-
      Stripe endpoint. Used for the managed SaaS version of this app. Eventually
      this will become a micro-service. Reach out to the team using contact info
      found at `docs.trieve.ai` for more information.
  - name: Health
    description: Health check endpoint. Used to check if the server is up and running.
  - name: Metrics
    description: Metrics endpoint. Used to get information for monitoring
  - name: Analytics
    description: Analytics endpoint. Used to get information for search and RAG analytics
  - name: Experiment
    description: Experiment endpoint. Used to create and manage experiments
paths:
  /api/crawl:
    post:
      tags:
        - Crawl
      summary: Create a new crawl request
      description: >-
        This endpoint is used to create a new crawl request for a dataset. The
        request payload should contain the crawl options to use for the crawl.
      operationId: create_crawl
      parameters:
        - name: TR-Dataset
          in: header
          description: The dataset id to use for the request
          required: true
          schema:
            type: string
            format: uuid
      requestBody:
        description: JSON request payload to create a new crawl
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/CreateCrawlReqPayload'
        required: true
      responses:
        '200':
          description: Crawl created successfully
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/CrawlRequest'
        '400':
          description: Service error relating to creating the dataset
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponseBody'
      security:
        - ApiKey:
            - admin
components:
  schemas:
    CreateCrawlReqPayload:
      type: object
      required:
        - crawl_options
      properties:
        crawl_options:
          $ref: '#/components/schemas/CrawlOptions'
    CrawlRequest:
      type: object
      required:
        - id
        - url
        - status
        - crawl_type
        - crawl_options
        - scrape_id
        - dataset_id
        - created_at
        - attempt_number
      properties:
        attempt_number:
          type: integer
          format: int32
        crawl_options:
          $ref: '#/components/schemas/CrawlOptions'
        crawl_type:
          $ref: '#/components/schemas/CrawlType'
        created_at:
          type: string
          format: date-time
        dataset_id:
          type: string
          format: uuid
        id:
          type: string
          format: uuid
        interval:
          type: string
          nullable: true
        next_crawl_at:
          type: string
          format: date-time
          nullable: true
        scrape_id:
          type: string
          format: uuid
        status:
          $ref: '#/components/schemas/CrawlStatus'
        url:
          type: string
    ErrorResponseBody:
      type: object
      required:
        - message
      properties:
        message:
          type: string
      example:
        message: Bad Request
    CrawlOptions:
      type: object
      description: Options for setting up the crawl which will populate the dataset.
      properties:
        add_chunks_to_dataset:
          type: boolean
          description: >-
            Add chunks to the dataset that the crawl is created for, defaults to
            true
          nullable: true
        allow_external_links:
          type: boolean
          description: Option for allowing the crawl to follow links to external websites.
          nullable: true
        body_remove_strings:
          type: array
          items:
            type: string
          description: Text strings to remove from body when creating chunks for each page
          nullable: true
        boost_titles:
          type: boolean
          description: >-
            Boost titles such that keyword matches in titles are prioritized in
            search results. Strongly recommended to leave this on. Defaults to
            true.
          nullable: true
        exclude_paths:
          type: array
          items:
            type: string
          description: URL Patterns to exclude from the crawl
          nullable: true
        exclude_tags:
          type: array
          items:
            type: string
          description: Specify the HTML tags, classes and ids to exclude from the response.
          nullable: true
        heading_remove_strings:
          type: array
          items:
            type: string
          description: >-
            Text strings to remove from headings when creating chunks for each
            page
          nullable: true
        ignore_sitemap:
          type: boolean
          description: Ignore the website sitemap when crawling, defaults to true.
          nullable: true
        include_paths:
          type: array
          items:
            type: string
          description: URL Patterns to include in the crawl
          nullable: true
        include_tags:
          type: array
          items:
            type: string
          description: Specify the HTML tags, classes and ids to include in the response.
          nullable: true
        interval:
          allOf:
            - $ref: '#/components/schemas/CrawlInterval'
          nullable: true
        limit:
          type: integer
          format: int32
          description: How many pages to crawl, defaults to 1000
          nullable: true
        scrape_options:
          allOf:
            - $ref: '#/components/schemas/ScrapeOptions'
          nullable: true
        site_url:
          type: string
          description: The URL to crawl
          nullable: true
        tags:
          type: array
          items:
            type: string
          description: Tags to add to the crawl
          nullable: true
        webhook_metadata:
          description: >-
            Metadata to send back with the webhook call for each successful page
            scrape
          nullable: true
        webhook_urls:
          type: array
          items:
            type: string
          description: Host to call back on the webhook for each successful page scrape
          nullable: true
      example:
        crawl_options:
          allow_external_links: false
          boost_titles: true
          exclude_tags:
            - '#ad'
            - '#footer'
            - header
            - head
            - navbar
            - footer
            - aside
            - nav
            - form
          heading_remove_strings:
            - Advertisement
            - Sponsored
          ignore_sitemap: true
          include_tags: []
          interval: daily
          limit: 50
          site_url: nedzo.ai
    CrawlType:
      type: string
      enum:
        - firecrawl
        - openapi
        - shopify
        - youtube
    CrawlStatus:
      oneOf:
        - type: string
          enum:
            - Pending
        - type: object
          required:
            - Processing
          properties:
            Processing:
              type: integer
              format: int32
              minimum: 0
        - type: string
          enum:
            - Completed
        - type: string
          enum:
            - Failed
    CrawlInterval:
      type: string
      description: Interval at which specified site should be re-scraped
      enum:
        - daily
        - weekly
        - monthly
    ScrapeOptions:
      oneOf:
        - allOf:
            - $ref: '#/components/schemas/CrawlOpenAPIOptions'
            - type: object
              required:
                - type
              properties:
                type:
                  type: string
                  enum:
                    - openapi
        - allOf:
            - $ref: '#/components/schemas/CrawlShopifyOptions'
            - type: object
              required:
                - type
              properties:
                type:
                  type: string
                  enum:
                    - shopify
        - allOf:
            - $ref: '#/components/schemas/CrawlYoutubeOptions'
            - type: object
              required:
                - type
              properties:
                type:
                  type: string
                  enum:
                    - youtube
      description: Options for including an openapi spec or shopify settigns
      discriminator:
        propertyName: type
    CrawlOpenAPIOptions:
      type: object
      title: CrawlOpenAPIOptions
      description: Options for including an openapi spec in the crawl
      required:
        - openapi_schema_url
        - openapi_tag
      properties:
        openapi_schema_url:
          type: string
          description: OpenAPI json schema to be processed alongside the site crawl
        openapi_tag:
          type: string
          description: >-
            Tag to look for to determine if a page should create an openapi
            route chunk instead of chunks from heading-split of the HTML
    CrawlShopifyOptions:
      type: object
      title: CrawlShopifyOptions
      description: Options for Crawling Shopify
      properties:
        group_variants:
          type: boolean
          description: >-
            This option will ingest all variants as individual chunks and place
            them in groups by product id. Turning this off will only scrape 1
            variant per product. default: true
          nullable: true
        tag_regexes:
          type: array
          items:
            type: string
          nullable: true
    CrawlYoutubeOptions:
      type: object
      title: CrawlYoutubeOptions
      description: Options for Crawling Youtube
  securitySchemes:
    ApiKey:
      type: apiKey
      in: header
      name: Authorization

````