First time posting on here, so sorry for any unorthodox formatting. I’m working on a project for ensemble training with model combinations but running into issues handling some data formats. I was able to successfully download and use gemma and llama language models on Kaggle but struggle to download and convert to useful model from the Bert models, for preprocessing. The file format is in .pb saved model format. So far, I have successfully imported the model data, built the encoder and have a model saved from the file downloaded (or least I think so). This is what I have so far:
import tensorflow as tf
from transformers import BertTokenizer
import kagglehub
import keras
# Download model (assuming api key and access is set)
path = kagglehub.model_download("tensorflow/bert/tensorFlow2/en-wwm-uncased-l-24-h-1024-a-16")
print("Path to model files:", path)
model_path=path
#Build model and encoder with keras and bert tokenizer
model = keras.layers.TFSMLayer(model_path, call_endpoint='serving_default')
encoder = BertTokenizer.from_pretrained(model_path+r'assetsvocab.txt')
# Proof of concept
print("User:")
input_text = tf.keras.layers.Input(shape=(), dtype=tf.string)
# Tokenize inputs (error here)
tokenize=[encoder(segment) for segment in input_text]
seq_length = 128
bert_pack_inputs = keras.Layer(
encoder,
arguments=dict(seq_length=seq_length)) # Optional argument.
The main issue I’m running into is with the tokenizer. When I tokenize my text, it throws the error:
Traceback (most recent call last):
File “C:Userscwaidexample4.py”, line 20, in
tokenize=[encoder(segment) for segment in input_text]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:UserscwaidAppDataLocalPackagesPythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0LocalCachelocal-packagesPython312site-packageskerassrcbackendcommonkeras_tensor.py”, line 120, in iter
raise NotImplementedError(
NotImplementedError: Iterating over a symbolic KerasTensor is not supported.
I’m not entirely sure if my model and encoder are initialized right based off of this error, but I’m stuck on how to fix it, since this is what is demonstrated from the Kaggle documentation:
https://www.kaggle.com/models/tensorflow/bert/tensorFlow2/en-wwm-uncased-l-24-h-1024-a-16
For the purposes of my project, I’m not using kaggle journal or jupyter notebook because I’m trying to build an isolated system of pretrained models for ensemble learning in lower-level language but building a proof of concept in python.
I have tried converting the pb files to a purely keras and meta file, but this isn’t a well-documented solution, so I’m trying to avoid it as to not complicate what I’m doing (although if its more suitable, I’m open to it). Further, I tried to convert it to a pytorch based system, but it seems that the data is not compatible for a direct translation unless my model and encoder are correct, but again, I don’t know if that is the case
Christian Waidner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.