I have created the following function that can generate a list of 0s and 1s (basically a bitstring) using randomization.
import numpy as np
def generate_genome(length: int, max_individuals: int = 0) -> Genome:
bits = None
if max_individuals > 0:
num_individuals = np.random.randint(1, max_individuals + 1)
print(f'Will have maximum of {num_individuals} individuals')
print('Controlled randomization')
bits = np.full(length, 0)
bits_flipped_ids = np.random.choice(range(0, length), size=num_individuals, replace=False)
print(f'Following indices will be flipped: {sorted(bits_flipped_ids)}')
np.put(a=bits, ind=bits_flipped_ids, v=1)
else:
print('Standard randomization')
bits = np.random.choice([0, 1], size=length)
genome = Genome(bits=bits.tolist())
print(f'Genome bits: {genome.bits}')
return genome
It supports two modes:
- without
max_individuals
– creates a bit-list of a specific length - with
max_individuals
– creates a bit-list of a specific length but also ensures that the number of1
s is not exceedingmax_individuals
. The placement of the1
s is at random indices (restricted by the allowed length of the list of course)
This is part of a genetic algorithm I am working on. This would produce samples similar to the ones below:
-
without
max_individuals
generate_genome(10, 0) [0 1 1 0 0 1 1 1 1 0] [1 0 1 1 1 1 0 1 1 1] [0 1 0 1 0 1 1 0 0 0] [1 1 0 0 0 1 1 0 1 1] [0 1 0 0 1 0 0 1 0 0]
-
with
max_individuals
generate_genome(10, 3) [0 0 0 0 0 0 0 0 0 1] with bits flipped at [9] [0 0 0 1 0 1 0 1 0 0] with bits flipped at [3 5 7] [0 1 0 0 0 1 0 1 0 0] with bits flipped at [1 5 7] [0 1 0 1 0 0 0 0 0 0] with bits flipped at [1 3]
My problem is that there is the possibility of generating same bit-lists. This possibility increases with the smaller length
and max_individuals
are. I would like to have control over that to see how it affects my algorithm so I am looking for an efficient way to create a set of unique bit-lists using my function or even better – a set of bit-lists where the number of unique lists can be controlled by a parameter.
I managed to simplify the generate_genome()
function based on a suggestion:
def generate_genome(length: int, max_individuals: int = 0):
# For the given length (number of bits) the maximum (where all are 1s)
# is (2^length - 1). The numpy.arange() stops at (stop - 1)
bits_all_possible = np.arange(2**length)
# Shuffle to introduce randomness
np.random.shuffle(bits_all_possible)
if max_individuals > 0:
bits_all_possible = np.array([b for b in bits_all_possible if np.bitwise_count(b) <= max_individuals])
# Pick a random index between 0 and the length of all possible bit-fields
bits_selected = np.random.randint(0, len(bits_all_possible))
# Use the index to select the number
bits_number = bits_all_possible[bits_selected]
# Convert the number to a bit-field
bits = [int(b) for b in bin(bits_number)[2:]]
genome = Genome(bits=bits.tolist())
print(f'Genome bits: {genome.bits}')
return genome
However, this is not a feasible solution in my case because we are talking about encoding and not plain old binary numbers. What this means is that I can have e.g. 3121200 of individuals. Respectively all genomes will have a maximum value of pow(N, length-1) = pow(2, 3121200-1) = ???
that cannot be calculated.
Even if it worked, this simplification does not fix the issue with creating a unique list of bit-fields by using generate_genome()
remains. The straight-forward (though not sure how efficient) solution would be to create a list and iteratively start calling generate_genome()
. Every time a new genome is created, I can check if it is already present in the list. If so, it will be discarded, otherwise – added.
Currently I am testing Python’s set
datastructure:
if unique_genomes:
# Using a set allows ignoring newly inserted elements if these are already present
genomes = set()
genomes_len_old = 0
for genome_counter in range(size):
genome = Genome.generate_genome(genome_length, max_individuals)
genomes_len_old = len(genomes)
genomes.add(genome)
# If the genome is already in the set, the number of elements in the set
# will not change
if genomes_len_old == len(genomes):
# Reduce the counter by 1 and try again
genome_counter -= 1
continue
genomes = list(genomes)
The problem with any solution that revolves around trying to insert a new genome and then again, if it is not unique, until a certain total number of genomes is reached is that it may lead to a timeless struggle to find the next genome that would fit in the already existing gene pool since every time generate_genome()
is called, there is an unknown possibility that a genome will be created that already exists, hence worst case I may create an endless loop. In the past I have added a termination criterion, namely how many attempts there should be before breaking. Here, this is not an option.
10
Let m
be the number of bits, k
be the number of bitstrings you want to generate, and r
be the maximum number of 1 bits in each bitstring (your max_individuals
parameter).
Some cases to consider:
-
No maximum limit on the number of 1 bits. In this case, the problem boils down to choosing
k
unique bitstrings of lengthm
. You can do this by samplingk
random integers from 0 up to2^m - 1
. Here’s the algorithm to do that very efficiently:import random import numpy as np def int2bs(x, m): """ Convert an integer to a bitstring of length m """ # round m up to a multiple of 8 mbytes = (m + 7) & ~7 xbytes = x.to_bytes(mbytes >> 3, "big") res = np.unpackbits(np.frombuffer(xbytes, np.uint8)) return res[mbytes - m:] def rand_bitstrings(m, k): n = 2 ** m if k > n: raise ValueError(f"Can't choose more than {n} bitstrings") if k >= n // 3 or n <= 32: # Selecting a large fraction of the possible bitstrings: faster to shuffle the whole list and take a subset all_bs = list(range(n)) random.shuffle(all_bs) res = all_bs[:k] else: # Rejection sampling with a guaranteed success rate of >= 2/3 per trial seen = set() res = [] while len(res) < k: bs = random.getrandbits(m) while bs in seen: bs = random.getrandbits(m) seen.add(bs) res.append(bs) # Convert integers to bitstrings return [int2bs(x, m) for x in res]
This is quite fast even for very large values of m
; for m = 3121200, k = 3
, this takes 19 milliseconds on my machine.
-
Generating an
m
-length bitstring with exactlyr
1 bits. In this case, we can do the following:return tuple(random.sample([0, 1], m, counts=[m-r, r]))
Then, we can put the returned
tuple
in a set to check for duplicates. -
Generating a bitstring with at most
r
1-bits. I don’t know of a good way to do this efficiently for generalr
and a uniform distribution amongst all bitstrings. However, here’s an approach that will efficiently generate random bitstrings of lengthm
with at mostr
1-bits, albeit with a bias towards bitstrings with around r/2 1-bits:return tuple(random.sample([0, 1], m, counts=[m, r]))
As with #2, you can put the returned tuples in a set to check for duplicates.
For #2 and #3, to avoid the possibility of a long rejection sampling delay, you can check if the total number of possible outputs is less than 3*k; if it is, just shuffle all the possibilities and output the first k
(as is done in the first answer).