I want to convert a C to Python3. I save a file in my C code with zlib.h and gzFile to “.gz”.
I know some libraries in Python3 such as zlib, gzip, pandas.DataFrame.pickle(compress = “gzip”) can save a file to “.gz” format. However, the compression rate is very different!! How can I perfectly implement the “zlib.h” for C in Python3?
I’ve tried everything I know how to do.
First, the example C code:
#include <stdio.h>
#include <stdlib.h>
#include <zlib.h>
#include <math.h>
#include <time.h>
int main()
{
void init_genrand64(unsigned long);
double genrand64_real2(void);
init_genrand64((unsigned long long)time(NULL));
double r = 0;
gzFile fp = gzopen("test_in_C.gz", "wb");
for (int i = 0; i < 293; ++i)
{
for (int j = 0; j < 10000; ++j)
{
r = genrand64_real2();
if(j < 9999)
{
gzprintf(fp, "%.8lf,", r);
}
else
{
gzprintf(fp, "%.8lfn", r);
}
}
}
gzclose(fp);
}
Next is all Python3 Codes:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns = range(293), data = np.random.rand(10000,293))
#csv, plain txt
df.to_csv("test_dat.csv", index = False, header = False)
#pickle, binary
df.to_pickle("test_dat.pk")
#pickle, compressed binary
df.to_pickle("test_dat.pkgz", compression = "gzip")
#npy, binary
np.save("test_dat.npy", df.values)
import gzip
#npy, compressed binary
with gzip.open("test_dat.npygz", "wb") as f:
np.save(f, df.values)
#csv, compressed binary
with gzip.open("test_dat.gz", "wb") as f:
for i in range(len(df)):
for j in range(len(df.columns)):
if j < len(df.columns) - 1:
f.write((str(df.iloc[i,j]) + ",").encode())
else:
f.write((str(df.iloc[i,j]) + "n").encode())
#npy, compressed binary
import zlib
dat = zlib.compress(df.values)
with open("test_dat.zl", "wb") as f:
f.write(dat)
The size of files:
test_in_C.gz 13MB
test_dat.csv 54MB (56460866B)
test_dat.pk 22MB (23440571B)
test_dat.npy 22MB (23440128B)
test_dat.pkgz 21MB (22104880B)
test_dat.npygz 21MB (22104445B)
test_dat.gz 24MB (25663307B)
test_dat.zl 21MB (22104671B)
In this experiment, we can see the .npygz file has the best compression rate, but the difference with test_in_C.gz originated from C is about double. As you can see, the file from C and file from Python3 are same shape (10000, 293).
How can I reduce this gap? This is very important problem for me, because I will handle very big data(about 800MB per file * 1,000 from C). If size of these files are double, maybe my PC and Python code are bombed.
Thank you for your help.