I am trying to obtain a standardised output using few-prompt templates in LangChain.
Suppose I have the following code:
Uisng the mtcars data
import statsmodels.api as sm
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats import anova
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
model = smf.ols(formula='np.log(mpg) ~ np.log(wt)', data=mtcars).fit()
print(anova.anova_lm(model))
print(anova.anova_lm(model).F["np.log(wt)"])
X=mtcars.iloc[:,2:]
Y=mtcars.mpg
X2 = sm.add_constant(X)
est = sm.OLS(Y, X2).fit()
print(est.summary2())
from langchain_openai import ChatOpenAI
from langchain.chains.llm import LLMChain
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv
_ = load_dotenv()
api_key = "sk-proj-OPENAI API KEY"
llm = ChatOpenAI(temperature=0, api_key=api_key)
prompt_template = PromptTemplate.from_template(
template="Interpret the coefficients from the following regression model {OLS_model}"
)
prompt = prompt_template.format(
OLS_model=model.summary()
)
print(prompt)
response = llm.predict(
text=prompt
)
print(response)
This gives me the following output:
In this regression model, the coefficient for the intercept is 3.9018. This means that when the independent variable (np.log(wt)) is zero, the expected value of the dependent variable (np.log(mpg)) is 3.9018.
The coefficient for np.log(wt) is -0.8418. This indicates that for every one unit increase in np.log(wt), the expected value of np.log(mpg) decreases by 0.8418 units.
Both coefficients are statistically significant with p-values less than 0.05, indicating that they have a significant impact on the dependent variable. The R-squared value of 0.806 suggests that the model explains 80.6% of the variance in the dependent variable.
I want to standardise the output, i.e. give it a new regression model but return a standardised output for each model.
Using the penguins data
import pandas as pd
import seaborn as sns
from palmerpenguins import load_penguins
import statsmodels.api as sm
import matplotlib.pyplot as plt
penguins = load_penguins()
penguins = penguins.dropna()
X = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = penguins['body_mass_g']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())
prompt_template = PromptTemplate.from_template(
template="Interpret the coefficients from the following regression model {OLS_model}"
)
prompt = prompt_template.format(
OLS_model=model.summary()
)
print(prompt)
response = llm.predict(
text=prompt
)
print(response)
Which gives the following output:
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
- The intercept coefficient (const) is -6445.4760, indicating the estimated body mass when all other independent variables are zero.
- The coefficient for bill_length_mm is 3.2929, suggesting that for every unit increase in bill length, the body mass is expected to increase by 3.2929 units.
- The coefficient for bill_depth_mm is 17.8364, indicating that for every unit increase in bill depth, the body mass is expected to increase by 17.8364 units.
- The coefficient for flipper_length_mm is 50.7621, suggesting that for every unit increase in flipper length, the body mass is expected to increase by 50.7621 units.
Overall, the model has an R-squared value of 0.764, indicating that approximately 76.4% of the variance in body mass can be explained by the independent variables included in the model. The F-statistic is significant at a very low p-value, suggesting that the overall model is statistically significant.
However, the outputs “vary” in their structure/focus – I want to use few-shot prompting to provide it with very specific structure / analysis and have consistent outputs.
i.e. provide it with a few examples on how to analyse a regression model based on the p-values and signs of the coefficients, R-squared.
Example I was working on: Suppose we are trying to predict / explain company sales.
from langchain_core.prompts.few_shot import FewShotPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate
examples = [
{
"instruction": "Interpret the following Fama and French model:",
"statistical_model": """
==============================================================================
Dep. Variable: Excess_Return R-squared: 0.906
Model: OLS Adj. R-squared: 0.905
Method: Least Squares F-statistic: 680.9
Date: Fri, 12 Jul 2024 Prob (F-statistic): 3.36e-108
Time: 19:54:44 Log-Likelihood: -413.31
No. Observations: 215 AIC: 834.6
Df Residuals: 211 BIC: 848.1
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.2395 0.116 2.066 0.040 0.011 0.468
X1 1.1338 0.026 42.792 0.000 1.082 1.186
X2 -0.0567 0.049 -1.157 0.248 -0.153 0.040
X3 -0.4178 0.035 -11.951 0.000 -0.487 -0.349
==============================================================================
Omnibus: 0.660 Durbin-Watson: 2.269
Prob(Omnibus): 0.719 Jarque-Bera (JB): 0.762
Skew: -0.123 Prob(JB): 0.683
Kurtosis: 2.843 Cond. No. 4.93
==============================================================================
""",
"answer": """
Factor Interpretation
Constant (const)
Coefficient: 0.2395
This represents the average monthly SALES of company X . It's statistically significant (p-value: 0.040), suggesting a positive sales unrelated to the other variables.
X1
Coefficient: 1.1338
This indicates that company X has a strong positive relationship with sales on number of hires. A 1 percentage increase in NEW HIRES is associated with a 1.1338 percentage increase in COMPANY X SALES, on average. The coefficient is highly significant (p-value: 0.000), highlighting a strong dependency on SALES.
Size Factor (SMB)
Coefficient: -0.0567
This suggests a slight negative relationship between Company X sales and the education of the managers, but the relationship is not statistically significant (p-value: 0.248). This means that Company X monthly Sales are not significantly affected by the education of its managers.
X2
Coefficient: -0.4178
Company X shows a significant negative relationship with this variable. A 1 percentage increase in the tech spending is associated with a 0.4178 percentage decrease in company X sales, on average. The coefficient is highly significant (p-value: 0.000), indicating that Company X, tends to perform inversely to investments in tech.
"""
},
{
"instruction": "Interpret the following regression model:",
"statistical_model": "No model available",
"answer": """
Uses Some other dataset/model
"""
},
]
example_prompt = PromptTemplate(
input_variables=["instruction", "statistical_model", "answer"],
template="Question: {instruction}n{statistical_model}n{answer}"
)
print(example_prompt.format(**examples[1]))
# . This object takes in the few-shot examples and the formatter for the few-shot examples.
prompt = FewShotPromptTemplate(
examples=examples, # Create a list of few-shot examples. Each example should be a dictionary with the keys being the input variables and the values being the values for those input variables.
example_prompt=example_prompt, # Configure a formatter that will format the few-shot examples into a string. This formatter should be a PromptTemplate object.
suffix="Question: {input}",
input_variables=["input"],
)
llm = ChatOpenAI(temperature=0, api_key=api_key)
#chain = LLMChain(llm=llm, prompt=prompt)
chain = prompt | llm
chain_response = chain.invoke("Interpret the SMB coefficient in the following Fama and French model {model.summary()}")
print(chain_response)
Expected output: A model where I can provide it a statistical output in python and it can return a structured answer.