I have a TSV file that contains bunch of lines prefixed with #
. While there are several columns in the file, I want only first and third columns – and those columns are duplicated, so I want to reduce it as much as I can.
The function in Python that does this:
def read_faulty_tsv(path):
# 1.74 s ± 18.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with open(path) as _f:
i = 0
header = None
for line in _f.readlines():
if line.startswith("#"):
i += 1
else:
break
data = pd.read_csv(path, sep="t", skiprows=i, header=None)
data = data.drop_duplicates([0,2])
return data
As noted with the timeit
magic, the execution time is 1.74s which is great!
A similar function in Rust is slower. I’ve tried two options – one which just native methods, and another using csv crate (csv="1.1"
in .toml). The first one takes 5.260 seconds and the second one takes 7.729 seconds which is drastically slower!
Am I doing something very wrong to have Python out-perform Rust?
The functions are below:
fn read_fault_tsv(file_path:&PathBuf)->Result<HashMap<u32,u32>, Box<dyn Error>>{
let mut final_labels: HashMap<u32, u32> = HashMap::new();
let file = File::open(file_path)?;
let reader = BufReader::new(file);
// Create a vector to store non-comment lines
let mut lines = Vec::new();
let start_time = Instant::now();
// Read lines and skip comment lines
for line in reader.lines() {
let line = line?;
if !line.starts_with('#') {
lines.push(line);
}
}
let joined_lines = lines.join("n");
// Create a CSV reader with tab delimiter from the collected lines
let mut rdr = ReaderBuilder::new()
.delimiter(b't')
.from_reader(joined_lines.as_bytes());
for result in rdr.records(){
match result {
Ok(record)=>{
let first_c = record[0].parse::<u32>().unwrap();
if !final_labels.contains_key(&first_c){
let third_c = record[0].parse::<u32>().unwrap();
final_labels.insert(first_c,third_c);
}
}
Err(e) => {
eprintln!("Error -- {}",e)
}
}
}
let elapsed_time = start_time.elapsed();
println!("Elapsed time: {}.{:03} seconds", elapsed_time.as_secs(), elapsed_time.subsec_millis());
Ok(final_labels)
}
fn read_fault_tsv2(file_path:&PathBuf)->Result<HashMap<u32,u32>, Box<dyn Error>>{
let mut final_labels: HashMap<u32, u32> = HashMap::new();
let file = File::open(file_path)?;
let reader = BufReader::new(file);
let lines = reader.lines()
.filter_map(Result::ok)
.filter(|line|!line.starts_with('#'));
let start_time = Instant::now();
for line in lines{
let columns: Vec<&str> = line.split('t').collect();
if columns.len() > 2 {
if let Ok(key) = columns[0].parse::<u32>() {
if !final_labels.contains_key(&key) {
if let Ok(value) = columns[2].parse::<u32>() {
final_labels.insert(key, value);
}
}
}
}
};
let elapsed_time = start_time.elapsed();
println!("Elapsed time: {}.{:03} seconds", elapsed_time.as_secs(), elapsed_time.subsec_millis());
Ok(final_labels)
}