Its my first time using spark and I am encountering some problems when I try to initiate DocumentAssembler. I am using Anaconda Jupyter. Really appreciate any help I can get! 🙂
I ran the following code:
spark = sparknlp.start()
spark = SparkSession.builder
.appName("Spark NLP")
.master("local[*]")
.config("spark.driver.memory", "16G")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.driver.maxResultSize", "0")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3")
.getOrCreate()
source_text_assembler = DocumentAssembler()
.setInputCol('source_text')
.setOutputCol('source_text_document')
.setIdCol('aid')
.setCleanupMode('inplace_full')
source_test_docs = source_text_assembler.transform(spark_df)
source_test_docs.limit(5).toPandas()
And this error was raised:
TypeError Traceback (most recent call last)
Cell In[26], line 3
1 from sparknlp.base import DocumentAssembler
—-> 3 source_text_assembler = DocumentAssembler()
4 .setInputCol(‘source_text’)
5 .setOutputCol(‘source_text_document’)
6 .setIdCol(‘aid’)
7 .setCleanupMode(‘inplace_full’)
9 source_test_docs = source_text_assembler.transform(spark_df)
10 source_test_docs.limit(5).toPandas()File ~sparkspark-3.5.1-bin-hadoop3pythonpyspark_init_.py:139, in keyword_only.<locals>.wrapper(self, *args, **kwargs)
137 raise TypeError(“Method %s forces keyword arguments.” % func.name)
138 self._input_kwargs = kwargs
–> 139 return func(self, **kwargs)File ~.condaLibsite-packagessparknlpbasedocument_assembler.py:96, in DocumentAssembler.init(self)
94 @keyword_only
95 def init(self):
—> 96 super(DocumentAssembler, self).init(classname=”com.johnsnowlabs.nlp.DocumentAssembler”)
97 self._setDefault(outputCol=”document”, cleanupMode=’disabled’)File ~sparkspark-3.5.1-bin-hadoop3pythonpyspark_init_.py:139, in keyword_only.<locals>.wrapper(self, *args, **kwargs)
137 raise TypeError(“Method %s forces keyword arguments.” % func.name)
138 self._input_kwargs = kwargs
–> 139 return func(self, **kwargs)File ~.condaLibsite-packagessparknlpinternalannotator_transformer.py:36, in AnnotatorTransformer.init(self, classname)
34 self.setParams(**kwargs)
35 self.class._java_class_name = classname
—> 36 self._java_obj = self._new_java_obj(classname, self.uid)File ~sparkspark-3.5.1-bin-hadoop3pythonpysparkmlwrapper.py:86, in JavaWrapper._new_java_obj(java_class, *args)
84 java_obj = getattr(java_obj, name)
85 java_args = [_py2java(sc, arg) for arg in args]
—> 86 return java_obj(*java_args)TypeError: ‘JavaPackage’ object is not callable
In an attempt to troubleshoot based on what I read online, I printed the following but was still unsure where the issue lies:
print(sys.path)
[‘C:UsersUSER’, ‘C:UsersUSERAppDataLocalTempspark-263962a5-0df8-4d47-9471-1d46e5918635userFiles-c3a490d4-e01c-4b35-a6ce-bf6508b3153f’, ‘C:UsersUSERsparkspark-3.5.1-bin-hadoop3pythonlibpy4j-0.10.9.7-src.zip’, ‘C:UsersUSERsparkspark-3.5.1-bin-hadoop3python’, ‘C:UsersUSER’, ‘C:UsersUSER.condapython311.zip’, ‘C:UsersUSER.condaDLLs’, ‘C:UsersUSER.condaLib’, ‘C:UsersUSER.conda’, ”, ‘C:UsersUSER.condaLibsite-packages’, ‘C:UsersUSER.condaLibsite-packageswin32’, ‘C:UsersUSER.condaLibsite-packageswin32lib’, ‘C:UsersUSER.condaLibsite-packagesPythonwin’]
pip show sparknlp
Name: sparknlp
Version: 1.0.0
Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
Home-page: http://nlp.johnsnowlabs.com
Author: John Snow Labs
Author-email:
License: UNKNOWN
Location: C:UsersUSER.condaLibsite-packages
Requires: numpy, spark-nlp
Required-by:
Note: you may need to restart the kernel to use updated packages.
print("Spark NLP version")
print(sparknlp.version())
print("Apache Spark version")
print(spark.version)
Spark NLP version
5.3.3
Apache Spark version
3.5.1