I’m trying to read all the genotypes in a VCF file using rust_htslib, convert them into a different coding, and store them in a Vector of Vectors, along with some metadata in a HashMap. I’ve got it working with the following code, but it’s exceedingly slow (10x slower than working with the VCF as a text file).
Is anyone familiar with rust_htslib and suggest a faster way to do this? I assume if I could pull all the genotypes for a record out at once and convert it to a Vector or similar, it would be more performant.
Am I missing something here? Surely there is a faster way to read the genotypes.
let mut bcf = Reader::from_path(vcf).expect("Error opening file.");
let sample_count = usize::try_from(bcf.header().sample_count()).unwrap();
let mut anml_lookup: HashMap<i32, usize> = HashMap::with_capacity(sample_count);
let mut genotypes: Vec<Vec<i8>> = vec![Vec::with_capacity(MAX_MARKERS); sample_count];
let mut inform: HashMap<usize, i32> = HashMap::with_capacity(sample_count);
for record in bcf.records().map(|r| r.expect("No record")) {
let b = rust_htslib::bcf::record::Buffer::new();
let gts = record.genotypes_shared_buffer(b).expect("Can't get GTs");
for sample_index in 0..sample_count {
let gt: (i8, i32) = conv(>s.get(sample_index).to_string());
*inform.entry(sample_index).or_insert(0) += gt.1;
genotypes[sample_index].push(gt.0);
}
}
AEON is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Worked out how to solve it, though it’s not documented so leaving a note here for anyone else who comes across this.
It turns out you can access an array of the genotype values via the format option. This can be done by providing the GT string tag as an array of bytes. Once you’ve got the genotypes you can store them as a vector and workwith them. This was about 3-4s faster than the original approach.
for record in bcf.records().map(|r| r.expect("No record")) {
let gts: Vec<(i8, i32)> = record
.format(b"GT")
.integer()
.expect("GTs error")
.to_vec()
.iter()
.map(|a| gtconv(a))
.collect();
for (mygt, sample_index) in gts.iter().zip(0..sample_count) {
inform[sample_index] += mygt.1;
genotypes[sample_index].push(mygt.0);
}
}
AEON is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.