Logo-amall

Then, ```python collection_info = qdrant.get_collection(collection_name=collection_name) collection_info.dict() ``` returns ```json {'status': , 'optimizer_status': , 'vectors_count': 0, 'indexed_vectors_count': 0, 'points_count': 0, 'segments_count': 8, 'config': {...} } ```

Last active 2 months ago

11 replies

1 views

  • NA

    Then,

    collection_info = qdrant.get_collection(collection_name=collection_name)
    collection_info.dict()
    

    returns

    {'status': ,
     'optimizer_status': ,
     'vectors_count': 0,
     'indexed_vectors_count': 0,
     'points_count': 0,
     'segments_count': 8,
     'config': {...}
    }
    
  • AN

    could you please share the script to reproduce the issue?

  • NA

    Sure, one sec

  • NA
    from sentence_transformers import SentenceTransformer
    from qdrant_client import QdrantClient
    from qdrant_client.http import models
    from tqdm import tqdm
    
    def batch_generator(list: list, batch_size: int) -> list:
        """
        Generate batches from a list
        """
        for i in range(0, len(list), batch_size):
            yield list[i:i + batch_size]
    
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
    
    rows_mapped = [...] # thjis is a list of dicts with text data internal to the company
    
    # compute embeddings for the rows and remap back to the dictionary
    BATCH_SIZE = 1000
    for batch in tqdm(batch_generator(rows_mapped, batch_size=BATCH_SIZE), total=int(len(rows_mapped) / BATCH_SIZE)):
        descriptions = [row['description'] for row in batch]
        embeddings = model.encode(descriptions, show_progress_bar=False)
        for i, row in enumerate(batch):
            row['embedding'] = embeddings[i]
    
    qdrant = QdrantClient(host="localhost", port=6333)
    qdrant.recreate_collection(
        collection_name="inventory", 
        vectors_config=models.VectorParams(
            size=model.get_sentence_embedding_dimension() , 
            distance=models.Distance.COSINE
        ),
    )
    
    collection_info = qdrant.get_collection(collection_name="inventory")
    collection_info.dict()
    
    # create point structs
    all_points = [
        models.PointStruct(
            id=row['InventoryId'],
            vector=row['embedding'].tolist(),
            payload={
                ... # metadata about the row added here
            }
        ) for row in tqdm(rows_mapped)
    ] # this should be ~100,000 for my case
    
    # batch upsert the points
    BATCH_SIZE = 1000
    for point_batch in tqdm(
        batch_generator(all_points, batch_size=BATCH_SIZE), 
        total=int(len(all_points) / BATCH_SIZE)
    ):
        qdrant.upsert(collection_name="inventory", points=point_batch)
    
    collection_info = qdrant.get_collection(collection_name="all_providers_inventory")
    collection_info.dict() # displays only ~38,000 points
    
  • NA
  • NA

    The data is internal to the company, so I have to omit some of that stuff… it comes from a company database so I can't make anything 100% reproducible

  • AN

    ok, assuming InventoryIds are all unique, another thing that might happen, is that points are in update queue, and not yet applied in the storage. Could you please try the same with ‘wait=True’ in upsert method, or just try to retrieve collection into after a few seconds

  • NA

    Sure. I can try that. Yes, InventoryIds are the primary keys to the inventory table im embedding

  • NA

    So, they should all be unique. I can verify that too, however

  • NA

    Maybe my SQL query has an error and is returning duplicate rows 😭 Wouldnt be the first time

  • NA

    Well what do you know… I had duplicate Inventory IDs. Sorry for blaming Qdrant thanks for the help!!!

Last active 2 months ago

11 replies

1 views