I’m trying to write Python code that will give me the p-value of specific GO Terms occurrence, as shown in a list of significantly expressed proteins from proteomic analysis done for us, given in an Excel table format. I wanted to make every bar which is of a significantly common GO term colored orange.
The Excel file I based it on had columns of protein ID/biological process/cellular component/molecular function, and each cell contained GO terms separated by ‘;’. I chose to focus on a specific cellular component, so the column of cc GO terms was irrelevant.
Bottom line:
I’m not sure I did manage to create a bar figure that well-represents the significant GO terms that are appearing more commonly within the group of significantly changed proteins, or the computation itself says something meaningful for my thesis as a GO term analysis figure,
Is being based on this table the way I’ve chosen is an acceptable analysis with meaningful conclusions to this kind of data?
and is this code too complicated or can be reduced somehow?
Here’s the current messy code. It produces a good looking graph, but I don’t know if it’s computationally accurate to show significant GO terms for the list of the significant proteins:
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import numpy as np
from scipy.stats import norm
# reading the data from Excel file
file_path = '123.xlsx'
data = pd.read_excel(file_path)
# columns names
biological_process_col = 'GO biological process'
molecular_function_col = 'GO molecular function'
protein_names_col = 'Protein ID'
p_value_col = "Student's T-test p-value KO-vs-WT_"
# checking if columns exist in the DataFrame
expected_columns = [biological_process_col, molecular_function_col]
# extracting GO terms and count their occurrences, separated by ;
def extract_terms(column):
terms = []
for entry in data[column].dropna():
terms.extend(entry.split(';'))
return terms
biological_process_terms = extract_terms(biological_process_col)
molecular_function_terms = extract_terms(molecular_function_col)
# counting the occurrences of each term
bp_counts = Counter(biological_process_terms)
mf_counts = Counter(molecular_function_terms)
# extracting significant proteins
significant_proteins = data[data[p_value_col] < 0.05][[protein_names_col, p_value_col]]
# normalizing p-values to [0, 1] range for coloring
min_p_value = significant_proteins[p_value_col].min()
max_p_value = significant_proteins[p_value_col].max()
normalized_p_values = (significant_proteins[p_value_col] - min_p_value) / (max_p_value - min_p_value)
# creating text for significant proteins with color mapping
significant_text = "n".join([f"{row[protein_names_col]}: {row[p_value_col]:.4f}" for _, row in significant_proteins.iterrows()])
# function to create and save plots
def plot_go_terms(term_counts, category, color, significant_proteins):
plt.figure(figsize=(100, 100))
bars = plt.bar(term_counts.keys(), term_counts.values(), color=color)
plt.xlabel('GO Terms', fontsize=30)
plt.ylabel('Count in significantly changed X-related proteins', fontsize=150)
plt.title(f'{category} GO Term Occurrences in (brain part) of (some genotype)', fontsize=30)
plt.xticks(rotation=90, fontsize=30)
# calculated mean, std, and threshold
counts = list(term_counts.values())
mean_count = np.mean(counts)
std_count = np.std(counts)
threshold = mean_count + norm.ppf(0.95) * std_count
# add a horizontal line at the threshold
plt.axhline(y=threshold, color='r', linestyle='--', linewidth=2)
plt.text(len(term_counts) - 1, threshold, f'Threshold: {threshold:.2f}', color='r', fontsize=30, verticalalignment='bottom')
# highlight bars that exceed the threshold
for bar in bars:
if bar.get_height() > threshold:
bar.set_color('orange')
# adjust layout and save the figure
plt.tight_layout(rect=[0, 0, 0.9, 1])
plt.savefig(f'{category.lower().replace(" ", "_")}_go_terms_analysis_figure.eps')
plt.savefig(f'{category.lower().replace(" ", "_")}_go_terms_analysis_figure.png')
plt.show()
# plotting Biological Process GO terms
plot_go_terms(bp_counts, 'Biological Process', 'blue', significant_proteins)
# plotting Molecular Function GO terms
plot_go_terms(mf_counts, 'Molecular Function', 'red', significant_proteins)
Sivan Alla Ronen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.