Question 1
What’s the computational complexity of the Groovy unique()
method?
Question 2
How could I have figured it out by myself? The unique()
method is defined in the class DefaultGroovyMethods
. The source code can be found here: org.codehaus.groovy.runtime.DefaultGroovyMethods. Can you point me to the piece of code which illustrates the answer to question 1?
7
As Lernkurve pointed out in the comments, the implementation given in Groovy’s source code appears to be O(n^2).
public static <T> Collection<T> unique(Collection<T> self, boolean mutate) {
List<T> answer = new ArrayList<T>();
for (T t : self) { // << Outer loop over each item
boolean duplicated = false;
for (T t2 : answer) { // << Inner loop over each found item
if (coercedEquals(t, t2)) {
duplicated = true;
break;
}
}
if (!duplicated)
answer.add(t);
}
if (mutate) {
self.clear();
self.addAll(answer);
}
return mutate ? self : answer ;
}
The outer loop ensures that the time complexity is at least O(n), because it will grow linearly as the input list grows. Likewise, the inner loop taken by itself is O(n). For each item in the input list, it then loops over each item that it’s registered into the “answer” list to see if that item is already contained. Put them together and you’ve got an O(n^2) function.
As I pointed out in the comments, there are a few ways you might get better performance out of this, for large sets. If you’re willing to spend a bit more memory, you can use a Hash set of some kind to register whether or not an element has already been included. These generally have an O(1) lookup time to check if the current item is unique to the list.
Alternatively, sorting the input list (if it is sortable) with a good sort function (most of those included in a language’s standard library will be O(n log n)) can help, assuming you don’t care about the order of the “unique” output list. This way you can be sure that duplicate elements will always appear in sequence with each other, and it becomes trivial to reject them from the output.
Indeed complexity is O(n^2)
. For deeper explanation of how we arrived at O(n^2)
:
Pretend input is [1 2 3 4 5 6]
Outer loop iteration 1: answer == [ ]
, t == 1
Outer loop iteration 2: answer == [1]
, t == 2
Outer loop iteration 3: answer == [1, 2]
, t == 3
Outer loop iteration 4: answer == [1, 2, 3]
, t == 4
Outer loop iteration 5: answer == [1, 2, 3, 4]
, t == 5
Outer loop iteration 6: answer == [1, 2, 3, 4, 5]
, t == 6
Now sum the number of each inner loop iterations for each of the 6 outer loop iterations:
0 + 1 + 2 + 3 + 4 + 5 = 15
(5 + 0) + (4 + 1) + (2 + 3) = 15
(n/2) * (n - 1) = (n^2 - n)/2 -> O(n^2) (this is the tightest bound)
List<T> answer = new ArrayList<T>();
for (T t : self) {
boolean duplicated = false;
for (T t2 : answer) {
if (comparator.compare(t, t2) == 0) {
duplicated = true;
break;
}
}
if (!duplicated)
answer.add(t);
}
if (mutate) {
self.clear();
self.addAll(answer);
}
return mutate ? self : answer;
I argue that one can write her own algorithm for accomplishing the same thing but in O(n)
time, specifically using a Map
. First throw all the items into the Map
using the specified key, then iterate through all values in the Map
. Throwing items into Map
is O(n)
, iterating is also O(n)
, which equates to O(n)
(not O(2n)
since the growth is still linear). Not sure why Groovy developers didn’t optimize, unless I’m missing something.