You need to have an IAM policy that allows to use the eksctl CLI.The most up-to-date guide is located hereYou are able to use the root account. However, AWS does not recommend doing this.
Create EKS cluster and install needed pluginsThe bootstrap-eks.sh script will create the EKS cluster, install the AWS Load Balancer Controller, and install the NVIDIA Device Plugin. This will also manage any IAM permissions that are needed for the plugins to work.Download the bootstrap-eks.sh script
Now you can modify your embedding_models.yaml. This defines all the models that you will want to use:
embedding_models.yaml
models: # ... # myEmbeddingModel: # # The number of replicas you want # replicas: 1 # # The huggingface revision # revision: main # # Your huggingface token if you have a private repo # hfToken: # # The end of the URL https://huggingface.co/BAAI/bge-m3 # modelName BAAI/bge-m3 bgeM3: replicas: 2 revision: main # The end of the URL https://huggingface.co/BAAI/bge-m3 modelName: BAAI/bge-m3 # If you have a private hugging face repo hfToken: "" spladeDoc: replicas: 2 # The end of the URL https://huggingface.co/naver/efficient-splade-VI-BT-large-doc modelName: naver/efficient-splade-VI-BT-large-doc isSplade: true spladeQuery: replicas: 2 # The end of the URL https://huggingface.co/naver/efficient-splade-VI-BT-large-doc modelName: naver/efficient-splade-VI-BT-large-doc isSplade: true bge-reranker: replicas: 2 modelName: BAAI/bge-reranker-large isSplade: false # ...
To ensure everything is working, make a request to the model endpoint provided.
# Replace the endpoint with the one you got from the previous stepexport ENDPOINT=k8s-default-vectorin-18b7ade77a-2040086997.us-east-2.elb.amazonaws.comcurl -X POST \ -H "Content-Type: application/json"\ -d '{"inputs": "test input"}' \ --url "http://$ENDPOINT/embed" \ -w "\n\nInfernce Took%{time_total} seconds\!\n"
The output should look like something like this
# The vector[[ 0.038483415, -0.00076982786, -0.020039458 ... ], [ 0.04496114, -0.039057795, -0.022400795, ... ]]Inference only Took 0.067066 seconds!
Each ingress point will be using their own Application Load Balancer within AWS. The Address provided is the model’s endpoint that you can make dense embeddings, sparse embeddings, or reranker calls based on the models you chose.Check out the guides for more information on configuration.
Using SPLADE Models
How to setup a dedicated instance for the sparse SPLADE embedding model
Using Custom Models
How to use private, gated Hugging Face models, or any models that you want
OpenAI compatibility
Trieve Vector Inference has OpenAI compatible routes