I want to get constituency trees for french documents. I’ve tried to install several tools but all of those I found are quite old and I didn’t succeed.
-
Benepar
: it looks very interesting but doesn’t seem compatible with python > 3.9, and requires old torch version (see https://github.com/grimavatar/benepar/blob/master/setup.py). Otherwise I’d like to test the CTL library which seems nice (https://stanfordnlp.github.io/CoreNLP/parser-standalone.html). Also triedSuPar
, which could be an alternative, but it’s also old and couldn’t get it working (https://github.com/yzhangcs/parser). -
Stanford
CoreNLP
/Stanza
: the more recent version for the Stanza french model doesn’t implement the constituency (https://stanfordnlp.github.io/stanza/constituency.html) ; I didn’find another model on HF. So I’m trying now to use the CoreNLP standalone parser, which has a french model (https://stanfordnlp.github.io/CoreNLP/parser-standalone.html) available as a.jar
file :stanford-corenlp-4.2.1-models-french.jar
.
-
If there is no fr model that provides constituency, is it possible to use Stanza with the .jar model I found ?
-
Could s.o provide a command to use with the
.jar
model ? In the docs there is this example which requires a.gz
model (probably installed with coreNLP ?)java -Xmx2g -cp "*" edu.stanford.nlp.parser.nndep.DependencyParser -model edu/stanford/nlp/models/parser/nndep/UD_French.gz -tagger.model edu/stanford/nlp/models/pos-tagger/french-ud.tagger -tokenized -textFile example.txt -outFile example.txt.out
EDIT :
the above command is working (I only have to place the jar file in the working directory) but not providing a constituency tree. This is explained in the readme of the parser :
The only provided French constituency parser is a shift-reduce parser. At this
time running the shift-reduce parser on French text requires running a pipeline
with the full Stanford CoreNLP package.
I have managed to obtain a tree by using the whole package CoreNLP and this command :
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-french.properties -annotators tokenize,ssplit,pos,parse -file example.txt -outputFormat text
(with the .jar file in the directory).
(doc : https://stanfordnlp.github.io/CoreNLP/parse.html)
Now it would be great to be able to integrate the model with python Stanza…
(n.b. : found also these questions very useful : Benepar for syntactic segmentation ; Stanford NLP : Constituency parser in French)
Many thanks in advance !