Logo-amall

does qdrant vector cosine distance work coherently with scikits hashingvectorizer? https://stackoverflow.com/questions/25573997/get-similarity-percent-with-sklearn-hashing-vectorizer

Last active 5 months ago

6 replies

7 views

  • RI

    does qdrant vector cosine distance work coherently with scikits hashingvectorizer? https://stackoverflow.com/questions/25573997/get-similarity-percent-with-sklearn-hashing-vectorizer

  • AN

    qdrant is expected to work with dense vectors.

  • RI

    after running several experiments, the sparse vectors from hashingvectorizer seemed to work as i required but are more accurate for longer more dissimilar sequences, shorter more similar texts seemed to converge on an identical float value. i.e "jn" "jssooon" and "jssnnoo" somehow returned a result around 0.5 but regardless for my particular collection its actually fine

  • CI

    Hey @Ricky Spanish, this sounds interesting. Could you elaborate on your approach? Would be very helpful! Thank you

  • RI

    i actually improved on that since my last post, but its a pretty normal approach. i use the hashing vectorizer from scikit since it doesent require fitting and as my vocabulary would otherwise be far too large to have in memory and the sensitivity for some of my tasks really isnt an issue so i just created the hashing vectorizer using the char-level n-gram option https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html after this is a few approaches im still experimenting with but basically i took my data of some millions of rows, split each field into a vector and created the several million unique vectors from that and im still experimenting with different collection architectures to see which is the best for search performance. I did nothing at all to qdrant the distance metric was cosine and just worked as expected in my tests.

  • CI

    thank you, I will look into this in more detail 🙂

Last active 5 months ago

6 replies

7 views