I am new to clustering. I want to cluster series that have good correlation together with one another. Start with an example:
set.seed(0)
r1 <- rnorm(1000)
r2 <- rnorm(1000)
e <- rnorm(1000)
d <- data.frame(r1=r1, r2=r2, e=e, r4=rnorm(1000), r.is.r1=r1, r.is.almost.r1=r1+rnorm(1000)/100,
r12=r1*0.75+r2*0.25,
r21= r1*0.25+r2*0.75, r21e= r1*0.25+r2*0.75+e/10, r21ee= r1*0.25+r2*0.75+e/2 )
print(round(cor(d),2))
plot( hclust( dist( t(d) ), method="centroid" ) )
which has the following correlations
r1 r2 e r4 r.is.r1 r.is.almost.r1 r12 r21 r21e r21ee
r1 1.00 -0.01 0.01 -0.05 1.00 1.00 0.94 0.30 0.29 0.26
r2 -0.01 1.00 0.02 -0.02 -0.01 -0.01 0.32 0.95 0.94 0.82
e 0.01 0.02 1.00 -0.01 0.01 0.01 0.02 0.02 0.14 0.52
r4 -0.05 -0.02 -0.01 1.00 -0.05 -0.05 -0.05 -0.04 -0.04 -0.03
r.is.r1 1.00 -0.01 0.01 -0.05 1.00 1.00 0.94 0.30 0.29 0.26
r.is.almost.r1 1.00 -0.01 0.01 -0.05 1.00 1.00 0.94 0.30 0.29 0.26
r12 0.94 0.32 0.02 -0.05 0.94 0.94 1.00 0.59 0.59 0.51
r21 0.30 0.95 0.02 -0.04 0.30 0.30 0.59 1.00 0.99 0.87
r21e 0.29 0.94 0.14 -0.04 0.29 0.29 0.59 0.99 1.00 0.92
r21ee 0.26 0.82 0.52 -0.03 0.26 0.26 0.51 0.87 0.92 1.00
and
which intuitively seems good (except negative correlations [ as in r4
and e
] shouldn’t be connected, but I can live with it; would love to kill this). I don’t really understand what the height is, other than that more correlated series have lower heights and a perfect correlation sits at zero.
my first wish for the plot is to put a measure of the correlation in small lettering on the tree roots — for example, 100%
between r1 and r.is.r1. is this possible?
my second wish is to cut the tree (e.g., leaving only r4
, e
, r12
, r21ee
, r2
PLUS the two clusters, A being r.is.almost.r1,r1,r.is.r1
and B being r21,r21e
and name the end branches. (my real application has hundreds of series, so I will need to cut it.)
somehow the plot needs to decide which of the three series in A or the two series in B I would want to put at the branch end. one option would be to paste
all three into one string, which works for small trees. another option would be to figure out which of my series in the cluster seems most “central” and just name that one. a final option would be for me to designate some series as being better names as others (e.g., a preference for naming r1
over r.is.r1
and over r.is.almost.r1
)
I hope these are two easy questions…