Can python be efficiently implemented in big data field? To be precise I am building an web app that analyses really big data in medical health care field consisting of medical history and huge personal information. I need some advice on how to handle very big data in python efficiently and with high performance. Also are their some open source packages in python available which have high performance and efficiency in big data handling?
About users and data:
Each user has about 3gb of data. Users are grouped based on their family and friend circles and the data is then analysed to predict important information and co-relations out of that. Currently I am talking about 10,000 users and will be rapidly increasing number of users.
4
That is a very vague question, there is no canon definition for what constitutes big data. From a development point of view the only thing that truly changes how you need to handle data is if you have so much that you can’t fit it all in memory at once.
How much of a problem that is depend greatly on what you need to do with the data, for most jobs you can do a single pass scheme where you load a block of data, do whatever needs to done with it, unload it, and go on to the next.
Sometimes issues can be solved by doing an organization pass, first going through data organizing it into chunks of data that need to be handled together, then going through each chunk.
If that strategy doesn’t fit your task you can still get a long way with OS handled disk swapping, handle the data in blocks as far as possible, but if you need a little arbitrary access here and there it is still going to work.
And of course an always excellent strategy when dealing with a lot of data is to dwarf it by hardware. You can get 64 GB of memory in 16 GB blocks for $500, if you are working with that much data it is an easily justified investment. Some good SSDs is a nobrainer.
Specific case:
A big part of this job is definitely to reduce those 3GB data per person. It is often a bit of an art in its own right to find out what can be thrown away, but given the amount I must presume that you have got a fair amount of bulk measurements, in general you should first find patterns and aggregations for those data, and then use those results for comparing persons to one another. The majority of your raw data is either noise, repetition or irrelevant, you have got to cut that away.
This reduction process is easily suitable for a cluster as you can just give each process its own pile of persons.
The processing thereafter is a bit more tricky, what is optimal depend on a lot of factors, and you will probably have to do some trial and error. If you can make it fit the job try to load select pieces of data from all persons on the same computer and compare those, do the same with other pieces of data on other computers. Use those results as new data sets etc.
2
It depends on what you want from your handling of big data. This concept is relatively vague. For example, if you’re talking about MapReduce jobs across disparate data sources, then you may be interested in using Hadoop Streaming with the Dumbo library. If you’re talking about statistical analysis, then NumPy and SciPy (as mentioned by Akira71) are interesting, as well as pandas (a data analysis toolkit). If you want graphing, look into matplotlib.
However, if you’re talking about the storage and querying of big data, Python is not your best bet. You will want something like the Hadoop ecosystem to make this perform well, perhaps with layers on top for querying and building intermediate data sets. One project that really interests me is Spark; you may want to look at it as well. Unfortunately, this type of application framework does not play to Python’s strengths.
2
Python is used extensively in the big data field. There are a couple of packages that tend to get used quite a bit and they are probably the main reason Python has made inroads so deeply into big data:
- NumPy – The Fundamental package for scientific computing in Python
- SciPy – Mathematics, Scientific and Engineering package
Given that they are both open source and the popularity and ease of learning Python has pretty much catapulted it’s use in Academia. This in turn has caused it to be used more and more outside academia and in larger companies or when students move into work roles they bring these packages with them.
These are very good packages and I had dabbled with them in a few projects. I have not used Python enough in Big Data projects to answer your ancillary question on how to handle Big Data efficiently with Python.
2