I am usually a Python / Database programmer, and I am considering using C for a problem.
I have a set of sequences, 8 characters long with 4 possible characters. My problem involves combining sets of these sequences and filtering which sets match a criteria. The combinations of 5 run into billions of rows and takes around an hour to run.
So I can represent each sequence as 2 bytes.
If I am working on a 64 bit architechture will I gain any advantage by keeping these data structures as 2 bytes when I generate the combinations, or will I be as well storing them as 8 bytes / double ? (64 bit = 8 x 8)
If I am on a 64 bit architecture, all registers will be 64 bit, so in terms of operations that shouldn´t be any faster (please correct me if I am wrong).
Will I gain anything from the smaller storage requirements – can I fit more combinations in memory, or will they all take up 64 bits anyway?
And finally, am I likley to gain anything coding in C. I have a first version, which stores the sequence as a small int in a MySQL database. It then self joins the tabe to itself a number of times in order to generate all the possible combinations. The performance is acceptable, depending on how many combinations are generated. I assume the database must involve some overhead.
4
You are likely to optimize for the wrong case. Remember the two golden rules of optimization:
- Premature optimization is the root of all evil.
- No optimizing without profiling.
As 48 = 65536 is significantly less than your ”billions of combinations”, I doubt you are going to run into performance issues regardless of the programming language you are using. Have you prototyped this in Python? Is it not fast enough? If you do want to optimize that, try to find an algorithm that doesn’t even create unwanted combinations.
The advantage of representing a single combination as an integer (for now, size is irrelevant), is that you can iterate through all combinations by simply incrementing that integer. This is going to be fast regardless of language. Note that floating point numbers are not useful here! In C, you could use an uint16_t
or larger in this scenario. You need a separate termination condition regardless of the type you are using.
How many bits any type actually takes up in memory is subject to alignment. You should only worry about this after re-reading the two above optimization rules, because your compiler will know more about alignment than you.
3
If I am working on a 64 bit architechture will I gain any advantage by keeping these data structures as 2 bytes when I generate the combinations, or will I be as well storing them as 8 bytes / double ? (64 bit = 8 x 8)
Squeezing your data into 2 bytes instead of 8 will obviously reduce the amount of memory needed for large arrays of these sequences to ~1/4. This fact is mostly independent of your processor architecure, regardless of beeing 64, 32, 16 or just 8 bit.
If this will result in any measurable performance improvement, depends if the packed representation of the amount of data you are going to process “in a whole” will fit completely into faster memory (for example, main memory), while the unpacked representation does not (for example, you will need external hard disk space for it).
Honestly, since you did not provide any details about how your “filtering” does look like, it is unclear if you need to unpack the data first before you can process it further, which may use up additional computing time. Or the difference in your case may show up beeing so small that the packing is fully unnecessary. So, what you have to do here is: try and measure, measure, measure!
2
On a typical 64 bit processor, 64 bit operations are a tiny, tiny bit faster than smaller operations. However, if you have the same number of items with four or eight times more storage, you will pay a much bigger penalty if your data doesn’t fit into your cache anymore. So 50,000 8 bit items will be much faster than 50,000 64 bit items. On the other hand, 10 8 bit items will be slightly slower than ten 64 bit items.