Inspiration

SaSS offerings for text embeddings have 2 major issues:

  1. They have higher latency, due to batch processing.
  2. They have heavy rate limits.

Trieve Vector Inference was created so you could Host Dedicated embedding servers in your own cloud.

Performance Difference

Benchmarks ran using wrk2 over 30 seconds on 12 threads and 40 active connections.

Machine used to test was on m5.large in us-west-1.

OPENAI CloudJINA AI Cloud*JINA (SageMaker)**TVI JinaTVI BGE-M3TVI Nomic
P50 Latency193.15 ms179.33 ms185.21 ms 19.06 ms 14.69 ms 21.36 ms
P90 Latency261.25 ms271.87 ms296.19 ms 23.09 ms 16.90 ms 29.81 ms
P99 Latency621.05 ms402.43 ms306.94 ms 24.27 ms 18.80 ms 30.29 ms
Requests Made324324324 324 324 324
Requests Failed003 0 0 0

* Failed requests was when rate limiting hit in (Jina AI rate limit is 60 RPM or 300 RPM for premium plan)

** jina-embeddings-v2-base-en on Sagemaker with ml.g4dn.xlarge

See more