A new article and the source code: Paragraph Aggregation Retrieval Model (PARM). It is particularly useful for document-to-document retrieval with longer texts such as legal cases, contracts, patents and others that exceed the maximum sequence length of the encoder model. They build the index at the level of paragraphs, and run separate queries for each paragraph of the query document. Getting separate lists of paragraphs for each paragraph of the query document, it is possible to have a document multiple times in a list, so they aggregate them into a single ranked list with what they call Vector-based Reciprocal Rank Fusion (VRRF). A standard reciprocal rank fusion is used for getting a single ranked list from multiple ranked lists sourced from different search systems, and it relies on the information of ranks and scores. With VRRF, they combine dense vectors with ranks and scores to outperform all other aggregation methods. https://github.com/sophiaalthammer/parm https://arxiv.org/abs/2201.01614
Last active a year ago
12 replies
8 views
- YU
A new article and the source code: Paragraph Aggregation Retrieval Model (PARM).
It is particularly useful for document-to-document retrieval with longer texts such as legal cases, contracts, patents and others that exceed the maximum sequence length of the encoder model.
They build the index at the level of paragraphs, and run separate queries for each paragraph of the query document.
Getting separate lists of paragraphs for each paragraph of the query document, it is possible to have a document multiple times in a list, so they aggregate them into a single ranked list with what they call Vector-based Reciprocal Rank Fusion (VRRF).
A standard reciprocal rank fusion is used for getting a single ranked list from multiple ranked lists sourced from different search systems, and it relies on the information of ranks and scores.
With VRRF, they combine dense vectors with ranks and scores to outperform all other aggregation methods.https://github.com/sophiaalthammer/parm
https://arxiv.org/abs/2201.01614 - YU
Do you think this one may be relevant? @andrey.vasnetsov https://github.com/qdrant/qdrant/issues/345
- AN
Might be https://github.com/qdrant/qdrant/issues/186 is more relevant, cause it should be still possible to store each paragraph as a separate record, but retrieving multiple queries in one request could significantly speed-up this use case
- YU
Yeah batch query is really a great idea, but if we validate the findings of this paper, it can be also optimized on the server side. Think that A document has 10 paragraphs, we need to retrieve 10 paragraphs * 10 top vectors for each * 768 vector dim = 76800 floating points streamed over the network. But we can avoid the latency due to getting those vectors to the client side if VRRF is handled on the server side. It can be a separate endpoint like /recommend we currently have.
- AN
As I understood from this image, they actually retrieve "document" associated with the paragraph. Is document also some kind of a vector? Also is VRRF a trainable function?
- YU
Not such a trainable function. But as it is a part of the evaluation pipeline for nDCG and recall metrics, it is still relevant to the training phase. During the training, they use DPR (Deep Passage Retrieval), which is state-of-the-art and also a method that we included in the awesome-metric-learning repo.
- AN
It is actually an interesting topic - to have some special retrieval method for large documents. But the thing is - I don't see that there is an industry standard right now, and even if PARM is a SOTA, it is still possible that it will be replaces by some other method soon. So if we are going to implement such retrieval method - it should either be extendable, or implement some very fundamental approach
- YU
Yeah you're right --it needs validating in other domains at least. Long document retrieval is interesting, but does it offer an improvement in other domains such as multiple images of the same object? Then we can be more confident this approach be a part of the engine.
- AN
Multiple images is a bit different. Usually image contain information about the whole object under the different angle, while the abstract is only a part of the whole document
- YU
Yeah but we don't deal with abstracts here. Some paragraphs of a document may be relevant to some paragraphs of another, and we try to retrieve those documents even if they are relevant only in certain paragraphs. I believe that we have a similar case with images --think that we have images of a building from the interior and the outer. In this case, interior and outer images may be relevant to completely different images based on what part of the building they capture.
- AN
In this case I agree, it is similar. But I was thinking about the pictures of the whole object from a different sides. Like in a task of landmark identification
- YU
Yeah landmark identification is really a big challenge that would probably require a more sophisticated pipeline.
Last active a year ago
12 replies
8 views