I would like to ask for your help in understanding where I go wrong in building a working example where I populate a repository with binary data, index it, and run a contains query.
I have logs to TRACE and I see the indexing working, upon executing the query however I always get 0 results.
Repository is MemoryNodeStore and I create it in this way:
LuceneIndexProvider provider = new LuceneIndexProvider();
Oak oak = new Oak(ns) // ns is a NodeStore MemoryNodeStore
.with((QueryIndexProvider) provider)
.with((Observer) provider)
.with(new LuceneIndexEditorProvider());
repository = new Jcr(oak).createRepository();
Then I populate it in this way:
Node node = rootNode.addNode("node" + i, "nt:unstructured");
byte[] data = ("testo" + i).getBytes();
ByteArrayInputStream bais = new ByteArrayInputStream(data);
Binary binary = session.getValueFactory().createBinary(bais);
try {
node.setProperty("binaryData", binary);
} finally {
binary.dispose();
}
node.setProperty("jcr:mimeType", "text/plain");
Then the index is in this way:
Node root = session.getRootNode();
Node oakIndex = root.getNode("oak:index");
Node index = oakIndex.addNode("contentTextIndex", "oak:QueryIndexDefinition");
index.setProperty("type", "lucene");
index.setProperty("async", (String[]) null);
Node indexRules = index.addNode("indexRules", "nt:unstructured");
Node ntBase = indexRules.addNode("nt:base", "nt:unstructured");
Node properties = ntBase.addNode("properties", "nt:unstructured");
Node binaryDataProperty = properties.addNode("binaryData", "nt:unstructured");
binaryDataProperty.setProperty("name", "binaryData";
binaryDataProperty.setProperty("propertyIndex", true);
binaryDataProperty.setProperty("analyzed", true);
Node jcrMimeTypeProperty = properties.addNode("jcr:mimeType");
jcrMimeTypeProperty.setProperty("name", "jcr:mimeType");
jcrMimeTypeProperty.setProperty("propertyIndex", true);
jcrMimeTypeProperty.setProperty("analyzed", true);
Then I search in this way:
String sql2QueryString = "SELECT * FROM [nt:base] WHERE CONTAINS([binaryData], 'testo')";
Query sql2Query = queryManager.createQuery(sql2QueryString, Query.JCR_SQL2);
QueryResult result = sql2Query.execute();
and I read the results in this way:
NodeIterator nodes = result.getNodes();
while (nodes.hasNext()) {
Node node = nodes.nextNode();
log.info("Path: " + node.getPath());
counter++;
}
log.info("Found {} results", counter);
I’m using oak 1.68.0 with tika-core and tika-parsers-standard-package 2.9.2.
In logs I see the indexing and the text extraction correctly, if you want I can attach a full log.
Really thank you for your help, best regards
6
I found that the documentation was misleading. I tried to fix it.
- If the aggregation rules are set, for binary data, you do not need to include it separately. When using aggregation, binary properties are automatically added to the fulltext index (but only there), if the mime type is set, and if the node is part of the index.
- If aggregation rules are not used, then you need to set
nodeScopeIndex
totrue
for the property (see example). - You need to query the fulltext index using eg.
SELECT * FROM [nt:base] WHERE CONTAINS(*, 'testo1')
- You have added
"testo" + i
so you also need to query for that, e.g.testo1
(see above), nottesto
. - Setting
analyzed
andpropertyIndex
for binary properties has no effect. They will not be added added to a property index or fulltext index for this property; they will only be added to the fulltext index for this node (document). - You do not need to index the
jcr:mimeType
property if you do not have queries for it. A query on the this property would have a condition withjcr:mimeType
.
Example index definition:
"/oak:index/contentTextIndex": {
"compatVersion": 2,
"type": "lucene",
"tags": ["text"],
"async": ["async"],
"queryPaths": ["/tmp"],
"includedPaths": ["/tmp"],
"jcr:primaryType": "oak:QueryIndexDefinition",
"indexRules": {
"jcr:primaryType": "nam:nt:unstructured",
"nt:base": {
"jcr:primaryType": "nt:unstructured",
"properties": {
"jcr:primaryType": "nt:unstructured",
"binaryData": {
"nodeScopeIndex": true,
"propertyType": "Binary",
"name": "jcr:data",
"jcr:primaryType": "nt:unstructured"
}
}
}
}
}
Example query:
select * from [nt:base] where
isdescendantnode('/tmp')
and contains(*, 'properties')
option(index tag [text])
6