I have over 1 billion rows of “pairs” of numbers that all contain 10 pairs.
I have another set of “pairs” say “edges”.
I need to loop through the edges. For each edge, I need to check if the edge appears in one of the rows. Then I need to update that row by replacing the pair with some symbols if the edge is in the row.
Are there quick ways to do this? I am ok if the most efficient software costs some money. Also, the software would need to be complex enough for me to write code / update the lines as above.
Approximately how long would you say this process will take? What factors will it depend on?
Thank you for any help!
I have tried using the standard python module as follows:
for edge in edges:
pair = (min(edge), max(edge))
str_pair = f'({str(pair[0])}, {str(pair[1])})'
print(str_pair)
with open(buffer_path, 'r+', buffering=1024*1024) as file:
while True:
line_start = file.tell()
line = file.readline()
if not line:
break
else:
if str_pair in set(line.split(".")):
new_line = line.replace(str_pair, len(str_pair) * '_')
file.seek(line_start)
file.write(new_line)
else:
file.seek(line_start)
file.write(line)
But this is far too slow
Prrr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
5
I think you can try to use some Python’s libraries dedicated to scientific computing or datascience such as numpy
or pandas
and use HDF5 files to store/retrieve your data using h5py
.
For example, your first lines could be replaced by:
import numpy as np
edge = np.asarray(edge) # edge as numpy array of shape (n, 10)
pair_max = edge.max(axis=0)
pair_min = edge.min(axis=0)
pair = np.stack([pair_min, pair_max], axis=-1)
You can use np.isin
to check if an element is in another array.
As a rule of thumb, try to avoid for loops and string conversion (and intermediary I/O operations if not needed), and replace them with vectorized operations.
Otherwise I’m afraid that you’ll have to switch to more performant langages such as C/C++, etc.
I benchmarked using golang
package main
import (
"fmt"
"strings"
"sync"
"time"
)
const numWorkers = 12 // Number of goroutines (workers) to run
func main() {
// Define the number of records (2 billion)
// numRecords := 2_000_000_000
numRecords := 100_000_000
// Start benchmark
simulateAndBenchmark(numRecords)
}
// Worker function to process a batch of records
func worker(id int, jobs <-chan int, wg *sync.WaitGroup, targetPair, replacement string) {
defer wg.Done()
for job := range jobs {
// Simulate a record with the target pair
record := fmt.Sprintf("Record %d: %s some other data", job, targetPair)
// Check if the record contains the target pair
if strings.Contains(record, targetPair) {
// Replace the target pair with underscores
record = strings.ReplaceAll(record, targetPair, replacement)
}
// For demonstration purposes, we won't print the record, but you could process it here
// fmt.Printf("Worker %d processed record %dn", id, job)
}
}
func simulateAndBenchmark(numRecords int) {
// Start timer
start := time.Now()
// Target string to replace
targetPair := "(1, 2)"
replacement := strings.Repeat("_", len(targetPair))
// Channel to distribute jobs
jobs := make(chan int, numWorkers)
// WaitGroup to synchronize all workers
var wg sync.WaitGroup
// Start the worker pool
for i := 1; i <= numWorkers; i++ {
wg.Add(1)
go worker(i, jobs, &wg, targetPair, replacement)
}
// Send jobs (records) to workers
for i := 0; i < numRecords; i++ {
jobs <- i
}
// Close the jobs channel to signal no more work
close(jobs)
// Wait for all workers to finish
wg.Wait()
// Record the time taken
elapsed := time.Since(start)
fmt.Printf("Processed %d records in %s using %d workersn", numRecords, elapsed, numWorkers)
}
On average the process takes about 34 minutes on a 16GB 3200 MT/s RAM with
CPU
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
Base speed: 2.42 GHz
Sockets: 1
Cores: 4
Logical processors: 8
Virtualisation: Enabled
L1 cache: 320 KB
L2 cache: 5.0 MB
L3 cache: 8.0 MB
Utilisation 11%
Speed 2.32 GHz
Up time 0:22:05:24
Processes 359
Threads 4952
Handles 291775
For the results are as follows for different workers and different ranges
Processed 1000000 records in 724.045ms using 8 workers
Processed 1000000 records in 446.1339ms using 12 workers
Processed 10000000 records in 4.6425266s using 12 workers
Processed 100000000 records in 52.8540714s using 12 workers
Processed 2000000000 records in 32m42.0265975s using 8 workers
But it would be even faster in C++ since it offers direct memory access.
I would perfer you using a better GPU and memory to make things even faster. The tests above are from a mere laptop with a basic GPU.