I’m trying to create a Pareto chart to obtain information about software repositories. Specifically, I want to verify if the number of commits and developers follow the Pareto law. I’m writing the following code in R:
# Function to create Pareto chart and save as PDF
create_pareto_chart <- function(input_file, output_dir) {
# Read the CSV file
data <- read.csv(input_file)
# Extract project name from file path
project_name <- basename(input_file)
project_name <- sub("\.csv$", "", project_name)
# Calculate commit counts per author
commit_counts <- data %>%
count(author) %>%
arrange(desc(n))
# Ensure there are no missing or NA authors
commit_counts <- commit_counts %>% filter(!is.na(author) & author != "")
# Calculate cumulative percentage of commits
total_commits <- sum(commit_counts$n)
commit_counts <- commit_counts %>%
mutate(cumulative_sum = cumsum(n),
cumulative_percentage = 100 * cumulative_sum / total_commits)
# Print the cumulative percentage values
print(commit_counts)
# Create Pareto chart
p <- ggplot(commit_counts, aes(x = reorder(author, -n), y = n)) +
geom_bar(stat = "identity", fill = "orange") +
geom_line(aes(y = cumulative_percentage * max(n) / 100, group = 1), color = "red") +
geom_point(aes(y = cumulative_percentage * max(n) / 100), color = "red") +
geom_text(aes(y = cumulative_percentage * max(n) / 100, label = round(cumulative_percentage, 1)),
vjust = -0.5, size = 3, color = "blue") + # Add labels to the cumulative line
geom_hline(yintercept = max(commit_counts$n) * 0.8, color = "red", linetype = "dashed") +
scale_y_continuous(
sec.axis = sec_axis(~ . * 100 / max(commit_counts$n), name = "Cumulative Percentage of Commits")
) +
labs(title = paste("Pareto Chart of Commits per Author (", project_name, ")", sep = ""),
x = "Authors (anonymized)",
y = "Number of Commits") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Save the plot as PDF
pdf(file = file.path(output_dir, paste(project_name, ".pdf", sep = "")), width = 14, height = 7)
print(p)
dev.off()
}
However, I noticed that the cumulative percentage shows unexpected fluctuations.
I printed the same values in the console, but I couldn’t identify any strange fluctuations.
I printed the same values in the consolle, however, I did’t identify any strange fluttuation.