I’m working with SparkNLP and PySpark, and my objecive is to be able to add another layer to BertSentenceEmbeddings. My input to BertSentenceEmbeddings is a paragraph with multiple sentences which results in an array of final embeddings.
My goal is to add a final pooling layer to BertSentenceEmbeddings so that I get one final embedding for the entire para instead of an array of embeddings per sentence of the para. Is there a pythonic way to access last layer of model via pyspark (or Python) to add a final layer?
This is to manipulate the final embedding I obtained from spark pipeline through dataframe manipulation. I’m interested in making it a part of the model output to streamline processes.
spark = sparknlp.start()
documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
.setCleanupMode("shrink")
sentenceDetector = SentenceDetectorDLModel()
.pretrained("sentence_detector_dl", "en")
.setInputCols(["document"])
.setOutputCol("sentences")
.setExplodeSentences(False)
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books", "en")
.setInputCols("sentences")
.setOutputCol("sentence_embeddings")
embeddingsFinisher = EmbeddingsFinisher()
.setInputCols(["sentence_embeddings"])
.setOutputCols("finished_embeddings")
.setOutputAsVector(True)
Note: I’m using pyspark 3.3.1
1