I have a query that compares two collections and finds the "missing" documents from one side. Both collections (existing and temp) contain about 250K documents.
FOR existing IN ExistingCollection LET matches = ( FOR temp IN TempCollection FILTER temp._key == existing._key RETURN true ) FILTER LENGTH(matches) == 0 RETURN existing
When this runs in a single-server environment (DB and Foxx are on the same server/container), this runs like lightning in under 0.5 seconds.
However, when I run this in a cluster (single DB, single Coordinator), even when the DB and Coord are on the same physical host (different containers), I have to add a
LIMIT 1000 after the initial
FOR existing ... to keep it from timing out! Still, this limited result returns in almost 7 seconds!
Looking at the Execution Plan, I see that there are several
GATHER statements after the
LET matches ... SubqueryNode. From what I can gather, the problem stems from the separation of the data storage and memory structure used to filter this data.
My question: can this type of operation be done efficiently on a cluster?
I need to detect obsolete (to be deleted) documents, but this is obviously not a feasible solution