Then, ```python collection_info = qdrant.get_collection(collection_name=collection_name) collection_info.dict() ``` returns ```json {'status': , 'optimizer_status': , 'vectors_count': 0, 'indexed_vectors_count': 0, 'points_count': 0, 'segments_count': 8, 'config': {...} } ```
Last active 2 months ago
11 replies
1 views
- NA
Then,
collection_info = qdrant.get_collection(collection_name=collection_name) collection_info.dict()
returns
{'status': , 'optimizer_status': , 'vectors_count': 0, 'indexed_vectors_count': 0, 'points_count': 0, 'segments_count': 8, 'config': {...} }
- AN
could you please share the script to reproduce the issue?
- NA
Sure, one sec
- NA
from sentence_transformers import SentenceTransformer from qdrant_client import QdrantClient from qdrant_client.http import models from tqdm import tqdm def batch_generator(list: list, batch_size: int) -> list: """ Generate batches from a list """ for i in range(0, len(list), batch_size): yield list[i:i + batch_size] model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2") rows_mapped = [...] # thjis is a list of dicts with text data internal to the company # compute embeddings for the rows and remap back to the dictionary BATCH_SIZE = 1000 for batch in tqdm(batch_generator(rows_mapped, batch_size=BATCH_SIZE), total=int(len(rows_mapped) / BATCH_SIZE)): descriptions = [row['description'] for row in batch] embeddings = model.encode(descriptions, show_progress_bar=False) for i, row in enumerate(batch): row['embedding'] = embeddings[i] qdrant = QdrantClient(host="localhost", port=6333) qdrant.recreate_collection( collection_name="inventory", vectors_config=models.VectorParams( size=model.get_sentence_embedding_dimension() , distance=models.Distance.COSINE ), ) collection_info = qdrant.get_collection(collection_name="inventory") collection_info.dict() # create point structs all_points = [ models.PointStruct( id=row['InventoryId'], vector=row['embedding'].tolist(), payload={ ... # metadata about the row added here } ) for row in tqdm(rows_mapped) ] # this should be ~100,000 for my case # batch upsert the points BATCH_SIZE = 1000 for point_batch in tqdm( batch_generator(all_points, batch_size=BATCH_SIZE), total=int(len(all_points) / BATCH_SIZE) ): qdrant.upsert(collection_name="inventory", points=point_batch) collection_info = qdrant.get_collection(collection_name="all_providers_inventory") collection_info.dict() # displays only ~38,000 points
- NA
- NA
The data is internal to the company, so I have to omit some of that stuff… it comes from a company database so I can't make anything 100% reproducible
- AN
ok, assuming InventoryIds are all unique, another thing that might happen, is that points are in update queue, and not yet applied in the storage. Could you please try the same with ‘wait=True’ in upsert method, or just try to retrieve collection into after a few seconds
- NA
Sure. I can try that. Yes,
InventoryId
s are the primary keys to the inventory table im embedding - NA
So, they should all be unique. I can verify that too, however
- NA
Maybe my SQL query has an error and is returning duplicate rows 😭 Wouldnt be the first time
- NA
Well what do you know… I had duplicate Inventory IDs. Sorry for blaming Qdrant thanks for the help!!!
Last active 2 months ago
11 replies
1 views