This guide takes ~30 minutes to complete. Expect ~20 minutes of this to be EKS spinning up.

Installation Requirements:

Getting your license

Contact us:

Our pricing is here

Check AWS Quota

Ensure you have quotas for both GPUs and load balancers.

  1. At least 4 vCPUs for On-Demand G and VT instances in the region of choice.

Check quota here

  1. You will need 1 load balancer for each model you want.

Check quota here

Deploying the Cluster

Setting up environment variables

Your AWS Account ID:

export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query "Account" --output text)"

Your AWS Region:

export AWS_REGION=us-east-2

Your Kubernetes cluster name:

export CLUSTER_NAME=trieve-gpu

Your machine types, we recommend g4dn.xlarge, as it is the cheapest on AWS. A single small node is needed for extra utility:

export CPU_INSTANCE_TYPE=t3.small
export GPU_INSTANCE_TYPE=g4dn.xlarge
export GPU_COUNT=1

Disable AWS CLI pagination (optional):

export AWS_PAGER=""

To use our recommended defaults:

export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query "Account" --output text)"
export AWS_REGION=us-east-2
export CLUSTER_NAME=trieve-gpu
export CPU_INSTANCE_TYPE=t3.small
export GPU_INSTANCE_TYPE=g4dn.xlarge
export GPU_COUNT=1
export AWS_PAGER=""
TVI supports all regions that have the GPU_INSTANCE that are chosen

Create your cluster

Create EKS cluster and install needed plugins

The bootstrap-eks.sh script will create the EKS cluster, install the AWS Load Balancer Controller, and install the NVIDIA Device Plugin. This will also manage any IAM permissions that are needed for the plugins to work.

Download the bootstrap-eks.sh script

wget cdn.trieve.ai/bootstrap-eks.sh

Run bootstrap-eks.sh with bash

bash bootstrap-eks.sh

This will take ~25 minutes to complete.

Install Trieve Vector Inference

Configure embedding_models.yaml

First, download the example configuration file:

wget https://cdn.trieve.ai/embedding_models.yaml

Now you can modify your embedding_models.yaml. This defines all the models that you will want to use:

embedding_models.yaml
models:
  # ...
  # myEmbeddingModel:
  #   # The number of replicas you want
  #   replicas: 1
  #   # The huggingface revision
  #   revision: main
  #   # Your huggingface token if you have a private repo
  #   hfToken:
  #   # The end of the URL https://huggingface.co/BAAI/bge-m3 
  #   modelName BAAI/bge-m3 
  bgeM3:
    replicas: 2
    revision: main
    # The end of the URL https://huggingface.co/BAAI/bge-m3
    modelName: BAAI/bge-m3 
    # If you have a private hugging face repo
    hfToken: "" 
  spladeDoc:
    replicas: 2
    # The end of the URL https://huggingface.co/naver/efficient-splade-VI-BT-large-doc
    modelName: naver/efficient-splade-VI-BT-large-doc 
    isSplade: true
  spladeQuery:
    replicas: 2
    # The end of the URL https://huggingface.co/naver/efficient-splade-VI-BT-large-doc
    modelName: naver/efficient-splade-VI-BT-large-doc 
    isSplade: true
  bge-reranker:
    replicas: 2
    modelName: BAAI/bge-reranker-large
    isSplade: false
  # ...

Install the helm chart

This helm chart will only work if you subscribe to the AWS Marketplace Listing.

Contact us at humans@trieve.ai if you do not have access to the AWS Marketplace or cannot use AWS marketplace.

1

Login to AWS ecr repository

 aws ecr get-login-password \
    --region us-east-1 | helm registry login \
    --username AWS \
    --password-stdin 709825985650.dkr.ecr.us-east-1.amazonaws.com
2

Install the helm chart from the Marketplace ECR repository

helm upgrade -i vector-inference \
    oci://709825985650.dkr.ecr.us-east-1.amazonaws.com/trieve/trieve-embeddings \
    -f embedding_models.yaml

Get your model endpoints

kubectl get ingress

The output looks something like this:

NAME                                              CLASS   HOSTS   ADDRESS                                                                  PORTS   AGE
vector-inference-embedding-bge-reranker-ingress   alb     *       k8s-default-vectorin-18b7ade77a-2040086997.us-east-2.elb.amazonaws.com   80      73s
vector-inference-embedding-bgem3-ingress          alb     *       k8s-default-vectorin-25e84e25f0-1362792264.us-east-2.elb.amazonaws.com   80      73s
vector-inference-embedding-spladedoc-ingress      alb     *       k8s-default-vectorin-8af81ad2bd-192706382.us-east-2.elb.amazonaws.com    80      72s
vector-inference-embedding-spladequery-ingress    alb     *       k8s-default-vectorin-10404abaee-1617952667.us-east-2.elb.amazonaws.com   80      3m20s

The Address field is the endpoint that you can make dense embeddings, sparse embeddings, or reranker calls based on the models you chose.

To ensure everything is working, make a request to the model endpoint provided.

# Replace the endpoint with the one you got from the previous step
export ENDPOINT=k8s-default-vectorin-18b7ade77a-2040086997.us-east-2.elb.amazonaws.com

curl -X POST \
     -H "Content-Type: application/json"\
     -d '{"inputs": "test input"}' \
     --url "http://$ENDPOINT/embed" \
     -w "\n\nInfernce Took%{time_total} seconds\!\n"

The output should look like something like this

# The vector
[[ 0.038483415, -0.00076982786, -0.020039458 ... ], [ 0.04496114, -0.039057795, -0.022400795, ... ]]
Inference only Took 0.067066 seconds!

Using Trieve Vector Inference

Each ingress point will be using their own Application Load Balancer within AWS. The Address provided is the model’s endpoint that you can make dense embeddings, sparse embeddings, or reranker calls based on the models you chose.

Check out the guides for more information on configuration.

Optional: Delete the cluster

CLUSTER_NAME=trieve-gpu
REGION=us-east-2

aws eks update-kubeconfig --region ${REGION} --name ${CLUSTER_NAME}

helm uninstall vector-release
helm uninstall nvdp -n kube-system
helm uninstall aws-load-balancer-controller -n kube-system
eksctl delete cluster --region=${REGION} --name=${CLUSTER_NAME}