I have a text file of ~600 CIDR notation IP blocks which, when expanded, amount to ~17.5M IP addresses. I need to socket connect to each one. If it connects, I add it to a “live” list, if it returns an error/refusal, to a “dead” list. Then the socket is closed. I don’t need to read from it, I don’t need to write to it. Obviously, this is a problem of scale, if we generously assume that the connection takes only one second to return success or failure, it would take months to complete, but likely several years. I need to get it down to <24 hours.
Right now I’m using Python to expand/count each of the IP addresses, because it is trivial to do so. I am writing a simple multi-threaded C program to address the above problem. There are a few ways I have thought of to tackle this:
-
Purely using C: I have not found a way to expand a CIDR block in C, (handling strings in general is a pain). I could probably cook something up, but if something already exists I’d love to hear about it.
Will I be able to spawn enough threads? Even if I spawn a thread for each block, that 600 threads! I feel like I need to shrink the stack space allotted to the threads to do this maybe? Even so, I need to be able to handle a large number of strings because the blocks need to be expanded. Regardless, I have looked at the list by hand, and one of the blocks has a /10 CIDR notation, which amounts to >4M IPs by itself. This would still take far too long. -
Spawning C processes from Python: This would trivialize the string problem, and each individual IP could be sent to an instance of a C function called from Python, which would then end. The question I have is: when Python calls an external C function, does it continue running with the C process in parallel? Or does it wait for the C function to complete? I know Python does not allow multi-threading (or rather, it does, but it’s somewhat of a joke since only one line is interpreted at a time), so is this the correct way to “export” multi-threading?
-
Vice versa: As above, but with C calling Python code, is this “more” correct? Which is to say, can C initiate multiple Python processes and continue to do it’s own thing?
-
Something completely different.
Any questions, suggestions, or concerns are welcome. Please point out anything I might be missing or incorrect assumptions I have made.
Thank you for your help.
2
You’re going to struggle to make this work as well as you’re hoping. The precise figures vary depending on operating system, but if you try opening more than a few hundred sockets at a time on an ongoing basis you’re going to start running out of system resources pretty quickly. On windows desktop machines the limit is lower still (windows desktop prevents activities like this as part of am intentional plan to reduce the effectiveness of ddos attacks and worms).
I would suggest:
-
use a single-threaded process and non-blocking i/o (e.g. select in c, I don’t know if python supports this)
-
distribute your task over a small cluster so that you only need 100 or so sockets on each machine. A cloud service (eg amazon ec2) may be your best option.
Also see https://stackoverflow.com/a/3923785/441899 which has hints on tuning a linux system to increase the number of parallel connection attempts you can make.
1
You need to break this down into two steps:
First, use python to parse the text file and generate a list of IP address that are easy to consume in C.
Second, let’s look at the exact problem.
You want to “connect” but you are not going to read or write. I am not sure what the purpose of this is. Couldn’t you use ping to accomplish the same thing?
If you still want to open a socket then you need to implement the three way TCP/IP handshake ( SYN, SYN-ACK, ACK) in a single thread. You will be dealing directly with the underlying IP layer, essentially simulating what TCP does for you.
If you remember that each ‘connection’ is really just a pair of address, port combinations and you have 64k ports at your disposal, then your speed is limited only by the latency of the handshake.
(This is starting to sound like a good homework question…)
If you can fire out the SYN packets at a rate of several hundred a second, and each transaction has a round trip latency of 200 ms… You can work out how long it will take you to work through your list of millions of addresses.
Here are some useful references.
You want to learn to use raw sockets. You will be implementing the taco handshake yourself.
http://www.tenouk.com/Module43a.html
https://stackoverflow.com/questions/110341/tcp-handshake-with-sock-raw-socket
http://gonullyourself.org/library/A%20brief%20programming%20tutorial%20in%20C%20for%20raw%20sockets.txt
Good luck
2
Focus on the problem, not the solution. Taking this problem back to abstract terms, you’ve got a scenario that screams out for multiple processes communicating via queues. One process (the “input reader”) would loop reading items from your input list (CIDR blocks) and append them to a to-be-enumerated queue. A second set of processes (the “enumerators”) would loop grabbing the topmost to-be-enumerated item, expand them and append the results (individual IP addresses), one at a time, to a to-be-checked queue. A third set of processes (the “checkers”) would loop grabbing the topmost to-be-checked item, perform the check, and append the results to a to-be-reported queue. The last process (the “reporter”) would loop grabbing the topmost item from the to-be-reported queue and writing it to the final results.
2