I don’t want to write down rules of extraction triples of relations as we do using Spacy, like example below (The reason is that there is many and i don’t have proficiency to write all of them):
# (...)
# Extrair Relações com Base em Substantivos e Preposições
if token.dep_ == "prep":
subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
obj = [w for w in token.rights if w.dep_ == "pobj"]
if subject and obj:
relations.append((subject[0].text, token.text, obj[0].text))
# Extrair Relações com Base em Nouns e Seus Predicativos
if token.dep_ == "attr":
subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
if subject:
relations.append((subject[0].text, token.head.lemma_, token.text))
# (...)
Because this, i want use CoreNPL + Universal Dependencies to extract the relations. I/m using pt_bosque_models. Bellow some details:
- Link of UD model: http://nlp.stanford.edu/software/stanfordnlp_models/0.2.0/pt_bosque_models.zip
- Version of UD: 2.14
- Version of corenlp: 4.5.7
To wake up the server i’m using this command:
java -cp "stanford-corenlp-4.5.7.jar" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-portuguese.properties -port 9000 -timeout 15000
My StanfordCoreNLP-portuguese.properties file content is:
annotators = tokenize,ssplit,pos,lemma,depparse
#tokenize.language = pt
ssplit.eolonly = true
# Modelo de dependência
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_Portuguese-Bosque.gz
The follow files are in UD_Portuguese-Bosque.gz:
LICENSE.txt
pt_bosque-ud-dev.conllu
pt_bosque-ud-dev.txt
pt_bosque-ud-test.conllu
pt_bosque-ud-test.txt
pt_bosque-ud-train.conllu
pt_bosque-ud-train.txt
README.md
stats.xml
This is my python example of request file:
import requests
# URL do servidor Stanford CoreNLP
url = 'http://[::1]:9000'
# Sentença de exemplo
sentence = "Qual é a opinião de Carl Sagan sobre a possibilidade de formas de vida baseadas em elementos diferentes do carbono e água?"
# Parâmetros para a requisição
params = {
'annotators': 'depparse,ner',
}
#tokenize,ssplit,pos,lemma,depparse,
# Dados para a requisição
data = {
'data': sentence
}
# Requisição ao servidor CoreNLP
response = requests.post(url, params=params, data=data)
# Verificar se a requisição foi bem sucedida
print(response)
if response.status_code == 200:
result = response.json()
for sentence in result['sentences']:
for triple in sentence['openie']:
print("Relação extraída:", triple['subject'], triple['relation'], triple['object'])
else:
print("Erro ao fazer requisição ao servidor Stanford CoreNLP.")
The Problem:
If “depparse” is present on params i get the error:
java.lang.NumberFormatException: For input string: "MDA4MTs2NmE2ZTExZDtDaHJvbWU7"
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
java.base/java.lang.Integer.parseInt(Integer.java:668)
java.base/java.lang.Integer.parseInt(Integer.java:786)
edu.stanford.nlp.parser.nndep.DependencyParser.loadModelFile(DependencyParser.java:539)
edu.stanford.nlp.parser.nndep.DependencyParserCache$DependencyParserSpecification.loadModelFile(DependencyParserCache.java:53)
edu.stanford.nlp.parser.nndep.DependencyParserCache.loadFromModelFile(DependencyParserCache.java:76)
edu.stanford.nlp.parser.nndep.DependencyParser.loadFromModelFile(DependencyParser.java:498)
If only “rer” is present on params the request return without errors but come without relations infos, i get only entities and tokens and without the key openie on result raising a error on line “for triple in sentence[‘openie’]:”
Any sugestion ?