Context :
-
Arango 3.12.1
-
inside a docker container
-
large collections (over 2m documents)
-
need to use analyzers
-
nested collections enterprise <– legal_unit
-
use of edges
In my case I would need to use views with text_fr analyzers (go through accents, lower case, uppercase…) and also filter using edges.
-
for instance I have this query:
-
I am trying to filter the collection enterprise based on the linked legal_units with admin_status == 1
-
my collection legal_unit has an index on admin_status ( tho the estimate selectivity is around 0 which is pretty strange because out of 2m documents 1m “1” and the second million has “3”
-
the link between legal_unit and enterprise is 1:1 tho i created a edge for simplicity :
unite_legale_de_l_entreprise
-
here in this case I don’t have SEARCH statements ( I use a query builder, for fields that are enums I use filter , for the other fileds SEARCH ANALYZERS STARTS_WITH)
-
THE QUERY :
LET FILTER_UNITE_LEGALE = (
FOR u IN vw_legal_unit
SEARCH 1==1
FILTER (u.admin_status == "1")
RETURN u._id
)
FOR e0 IN unite_legale_de_l_entreprise
FILTER e0._from IN FILTER_UNITE_LEGALE
FOR e IN vw_enterprise
SEARCH 1==1
FILTER e._id == e0._to AND 1==1
COLLECT WITH COUNT INTO length
RETURN length
here is the explain of the query :
Query String (425 chars, cacheable: true):
LET FILTER_UNITE_LEGALE = (
FOR u IN vw_legal_unit
SEARCH 1==1
FILTER (u.admin_status == "1")
RETURN u._id
)
FOR e0 IN unite_legale_de_l_entreprise
FILTER e0._from IN FILTER_UNITE_LEGALE
FOR e IN vw_enterprise
SEARCH 1==1
FILTER e._id == e0._to AND 1==1
COLLECT WITH COUNT INTO length
RETURN length
Execution plan:
Id NodeType Par Est. Comment
1 SingletonNode 1 * ROOT
18 SubqueryStartNode 1 - LET FILTER_UNITE_LEGALE = ( /* subquery begin */
3 EnumerateViewNode 7695748 - FOR u IN vw_legal_unit SEARCH true /* view query */
4 CalculationNode 7695748 - LET #7 = (u.`admin_status` == "1") /* simple expression */
5 FilterNode 7695748 - FILTER #7
6 CalculationNode 7695748 - LET #8 = u.`_id` /* attribute expression */
19 SubqueryEndNode 1 - RETURN #8 ) /* subquery end */
17 IndexNode ✓ 10 - FOR e0 IN unite_legale_de_l_entreprise /* edge index scan, index only (projections: `_to`) */ LET #11 = e0.`_to`
12 EnumerateViewNode ✓ 21405490 - FOR e IN vw_enterprise SEARCH true /* view query */
13 CalculationNode ✓ 21405490 - LET #10 = ((e.`_id` == #11) && true) /* simple expression */
14 FilterNode ✓ 21405490 - FILTER #10
15 CollectNode ✓ 1 - COLLECT AGGREGATE length = LENGTH() /* count */
16 ReturnNode 1 - RETURN length
Indexes used:
By Name Type Collection Unique Sparse Cache Selectivity Fields Stored values Ranges
17 edge edge unite_legale_de_l_entreprise false false false 99.98 % [ `_from` ] [ ] (e0.`_from` IN FILTER_UNITE_LEGALE)
Optimization rules applied:
Id Rule Name Id Rule Name Id Rule Name
1 move-calculations-up 5 use-indexes 9 reduce-extraction-to-projection
2 move-filters-up 6 remove-filter-covered-by-index 10 optimize-projections
3 move-calculations-up-2 7 handle-arangosearch-views 11 async-prefetch
4 move-filters-up-2 8 remove-unnecessary-calculations-2 12 splice-subqueries
54 rule(s) executed, 1 plan(s) created, peak mem [b]: 0, exec time [s]: 0.00099
I guess since the selectivity estimate of admin_status is around 0 the optimizer decided not to use it … tho, i dont understand why the selectivity would be around 0 for this index :
FOR doc IN unite_legale
COLLECT status = doc.admin_status WITH COUNT INTO count
RETURN {
"admin_status": status,
"count": count
}
[
{
"admin_status": "1",
"count": 1198532
},
{
"admin_status": "3",
"count": 942016
}
]
{
"cacheEnabled": false,
"deduplicate": false,
"estimates": true,
"fields": [
"admin_status"
],
"figures": {
"memory": 31846005,
"cacheInUse": false,
"cacheSize": 0,
"cacheUsage": 0
},
"id": "unite_legale/583153",
"name": "admin_status_legal_idx",
"selectivityEstimate": 9.343401783094796e-7,
"sparse": false,
"type": "persistent",
"unique": false
}
So here are my questions:
– Is my synthax for my use case the right way to do it ?
– is my index on admin_status broken ?
– how could I improve performances ?
I tried several indexing logic,
using the collection over the view
I’d like to use the vertex logic with views but the synthax logic still is a bit odd for me