I know I can use MLflow to track LLM models and evaluate their performance.
For OpenAI models, I can use the following code snippet to log the model and evaluate it:
with mlflow.start_run() as run:
system_prompt = "Answer the following question in two sentences“
basic_qa_model = mlflow.openai.log_model(
model="gpt-4o-mini",
task=openai.chat.completions,
artifact_path="model",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "{question}"},
],
)
results = mlflow.evaluate(
basic_qa_model.model_uri,
eval_df,
targets="ground_truth", # specify which column corresponds to the expected output
model_type="question-answering", # model type indicates which metrics are relevant for this task
evaluators="default",
)
results.metrics
However, I now want to track a model like Claude 3.5-sonnet running on Amazon Bedrock using MLflow, but I couldn’t find a direct integration or documentation for this scenario.
Could someone provide guidance or a code example on how to log and evaluate LLM models from Bedrock with MLflow, similar to how it’s done with OpenAI models?
Any help or pointers would be appreciated!
1