I have pyspark dataframe I changed it to pandas to use matplotlib. The bar plot shows all bars as equal height although the y values are different.
Since its small data I put the x and y values in list and used matplotlib to draw the graph. It showed the expected result. I am not sure if the error is due to the datatype in the pandas dataframe (float64)
pandas_df.dtypes
display(pandas_df )
customer_id=pandas_df['customer_id']
price=pandas_df['sum(price)']
plt.bar(customer_id,sum(price))
plt.show()
y=[4260.0,4440.0,2400.0,1200.0,2040.0]
x=['A','B','C','D','E']
print(type(y[0]))
plt.bar(x,y)
plt.show()
The problem that I can see here is that you are passing sum(price)
in the plt.bar()
function, which adds up all the values resulting in single value for each value of customer_id
.
Following is corrected code:
pandas_df.dtypes
display(pandas_df )
customer_id=pandas_df['customer_id']
price=pandas_df['sum(price)']
plt.bar(customer_id, price) # Do not use sum() function here, as it adds all values of price list
plt.show()
y=[4260.0,4440.0,2400.0,1200.0,2040.0]
x=['A','B','C','D','E']
print(type(y[0]))
plt.bar(x,y)
plt.show()
Hope it solves your problem.
The problem lies in the line plt.bar(customer_id,sum(price))
.
Let’s try to explain using list data
y=[4260.0,4440.0,2400.0,1200.0,2040.0]
x=['A','B','C','D','E'],
say,
x = Customer_id series you got by doing the below
customer_id=pandas_df['customer_id']
y = price series you got by doing the below
price=pandas_df['sum(price)']
What you did while using customer_id and price is plt.bar(x, sum(y))
, sum(y)
(i.e. sum of list) will always remain the same hence the same height.
Try below to see the price against customer_id
plt.bar(customer_id, price)