Imagine having this dataset:
Country Energy_Source Twh Tot
Italy Biofuel 24.5 100
Italy Nuclear 15.4 100
Italy Gas 40.1 100
Italy Hydro 20.0 100
France Biofuel 20.0 120
France Nuclear 75.0 120
France Gas 10.0 120
France Hydro 4.3 120
France Wind 10.7 120
Note: Tot
is the sum of Twh
by Country
dataset1 <- data.frame(
"Country" = c(rep(x = "Italy", times = 4), rep(x = "France", times = 5)),
"Energy_Source" = c("Biofuel", "Nuclear", "Gas", "Hydro", "Biofuel", "Nuclear", "Gas", "Hydro", "Wind"),
"Twh" = c(25, 15, 40, 20, 20, 75, 10, 5, 10),
"Tot" = c(rep(x = 100, times = 4), rep(x = 120, times = 5))
)
Now, we want ggplot2
to interpret this dataset1
as if it was like the following (dataset2
) without performing a pivot_longer
on dataset1
Here the new dataset2
that represents exactly the same informations as dataset1
but with duplicates for ggplot2
to interpret the occurences of each element as a proportion
Country Energy_Source Twh Tot
Italy Biofuel 25 100
Italy Biofuel 25 100
Italy Biofuel 25 100
.
.
. (22 more rows)
Italy Nuclear 15 100
. (14 more rows)
Italy Gas 40 100
. (etcetera)
dataset2 <- data.frame(
"Country" = c(rep(x = "Italy", times = 100), rep(x = "France", times = 120)),
"Energy_Source" = c(rep(x = "Biofuel", times = 25), rep(x = "Nuclear", times = 15),
rep(x = "Gas", times = 40), rep(x = "Hydro", times = 20), rep(x = "Biofuel", times = 20),
rep(x = "Nuclear", times = 75), rep(x = "Gas", times = 10), rep(x = "Hydro", times = 5),
rep(x = "Wind", times = 10)),
"Tot" = c(rep(x = 100, times = 100), rep(x = 120, times = 120))
)
Now, normally we would use the following code to represent the barplots
ggplot(data = dataset2, mapping = aes(
x = Tot,
y = reorder(Country, Tot),
fill = Energy_Source
)) +
geom_col()
See here:
But is it possible to use dataset1
and not dataset2
to create the same graph with ggplot2
?
In other terms:
How to fill a barplot, not according to the occurences of an element in a dataset, but according to its value in another variable?
Thanks!
I tried performing a pivot_longer
from the tidyr
package but it was too costly for my Shiny App.
3
Here are two ways to recreate your plot using dataset1
.
- Scale in proportion to
Twh
. This seems simplest and most efficient, provided you don’t need the visible bars to be composed of many stacked smaller bars.
ggplot(dataset1, aes(
x = Tot*Twh,
y = reorder(Country, Tot),
fill = Energy_Source
)) +
geom_col()
tidyr::uncount
is what you want if you want to make copies of each observation. This replicates yourdataset2
approach. I have added borders to show how this makes many small bars that are stacked together. This approach is fine here, but I’ve had issues where it might plot very slowly (e.g. if >100k observations to plot), or plot messily (e.g. if the borders overwhelm the areas or create moire effects), or inefficiently (e.g. a vector format like PDF would save a separate object for each bar plotted, even if <1 pixel).
ggplot(dataset1 |> tidyr::uncount(Twh), aes(
x = Tot,
y = reorder(Country, Tot),
fill = Energy_Source
)) +
geom_col(color = "gray50")
Edit: For the OP’s description of “sequential” data, I think it’s more efficient computationally and cleaner to plot if you calculate the aggregations (here, the average usage across years of energy source per country) with a dplyr step.
Compare a version of what’s in the OP’s suggested answer:
rbind(dataset1, dataset1b) |>
mutate(across(Twh, ~.x * 10)) |>
uncount(Twh) |>
ggplot(aes(
x = Tot / (10 * 2),
y = reorder(Country, Tot),
fill = Energy_Source
)) +
geom_col()
…to a version using dplyr. Same superficial visual appearance, except the uncount
version plots 4,400 observations, vs. the dplyr version just plots the 9 contiguous bars we can see.
rbind(dataset1, dataset1b) |>
summarize(Total = mean(Tot * Twh), .by = c(Country, Energy_Source)) |>
ggplot(aes(
x = Total,
y = reorder(Country, Total),
fill = Energy_Source
)) +
geom_col()
For reference, your example plot with the same dimensions:
1
Jon Spring answer is great.
/users/6851825/jon-spring?tab=profile
Now, to go further if you have a sequential dataset with decimal values to the variables you are working with (and you do not want to round them), here is how to deal with it:
Example dataset:
dataset1 <- data.frame(
"Country" = c(rep(x = "Italy", times = 4), rep(x = "France", times = 5)),
"Energy_Source" = c("Biofuel", "Nuclear", "Gas", "Hydro", "Biofuel", "Nuclear", "Gas", "Hydro", "Wind"),
"Twh" = c(24.3, 15.7, 40.0, 20.0, 19.1, 75.9, 10.0, 5.0, 10.0),
"Tot" = c(rep(x = 100, times = 4), rep(x = 120, times = 5)),
"Year" = c(rep(x = 2012, times = 9))
)
dataset1b <- data.frame(
"Country" = c(rep(x = "Italy", times = 4), rep(x = "France", times = 5)),
"Energy_Source" = c("Biofuel", "Nuclear", "Gas", "Hydro", "Biofuel", "Nuclear", "Gas", "Hydro", "Wind"),
"Twh" = c(25.0, 15.0, 40.0, 20.8, 19.2, 75.0, 10.0, 5.0, 10.0),
"Tot" = c(rep(x = 100, times = 4), rep(x = 120, times = 5)),
"Year" = c(rep(x = 2013, times = 9))
)
dataset1 <- rbind(dataset1, dataset1b)
head(dataset1, n = 10)
## Country Energy_Source Twh Tot Year
## 1 Italy Biofuel 24.3 100 2012
## 2 Italy Nuclear 15.7 100 2012
## 3 Italy Gas 40.0 100 2012
## 4 Italy Hydro 20.0 100 2012
## 5 France Biofuel 19.1 120 2012
## 6 France Nuclear 75.9 120 2012
## 7 France Gas 10.0 120 2012
## 8 France Hydro 5.0 120 2012
## 9 France Wind 10.0 120 2012
## 10 Italy Biofuel 25.0 100 2013
Note: Tot
represents the sum(Twh)
by Country
First, multiply the Twh
column with a round number that will transform all of the decimals to int. In the example, none of the values has more than one number after the 0, so we multiply Twh
by 10, let’s called it multiplicator
multiplicator <- 10
dataset1$Twh <- dataset1$Twh * multiplicator
head(dataset1)
## Country Energy_Source Twh Tot Year
## 1 Italy Biofuel 243 100 2012
## 2 Italy Nuclear 157 100 2012
## 3 Italy Gas 400 100 2012
## 4 Italy Hydro 200 100 2012
## 5 France Biofuel 191 120 2012
## 6 France Nuclear 759 120 2012
Second, we apply the tidyr::uncount()
function to dataset1
with Twh
as a parameter, it is the variable that will dupplicate each row according to its value at each row.
dataset1 <- dataset1 |> tidyr::uncount(Twh)
## Country Energy_Source Tot Year
## 1 Italy Biofuel 100 2012
## 2 Italy Biofuel 100 2012
## 3 Italy Biofuel 100 2012
## 4 Italy Biofuel 100 2012
## 5 Italy Biofuel 100 2012
## 6 Italy Biofuel 100 2012
Finally we plot the data using ggplot2
as following:
ggplot(dataset1, aes(
x = Tot / (multiplicator * length(unique(dataset1$Year))),
y = reorder(Country, Tot),
fill = Energy_Source
)) +
geom_col()
See:
As ggplot2
understands the proportion of each element by their occurence in a dataset and that multiplicator
and the non grouping column wich is Year
affected the occurence, we divide Tot
by (multiplicator * length(unique(dataset1$Year)))
.
1