I am trying to read a file with various threads in C, but as I divide in chunks based of bytes some may start/end in the middle of lines. I was trying to adjust the chunk size in case that happened.
I don’t know if this is a good approach, as I understand that multithread reading a file may only be beneficial in the SSD era. I don’t know if trying to read simultaneous a file may induce some type of memory error… Maybe should I only multithread the processing of each line and other tasks?
So, apart from that, my approach was to shift the start and the end of a chunk until it finds the end of a line. This way, if a chunk happens to be inside a line (for when number of threads > number of lines) the start and end will match and I will not start that thread. I don’t what may be still be missing, but I know that most of the time it reads just fine, but sometimes, one line is not read, or a line is read twice by the same thread or a small portion of a line is read by another thread, when the line itself was read just before.
long chunk_size = file_size / num_threads;
pthread_t threads[num_threads];
ThreadData thread_data[num_threads];
long last_end = 0;
for (uint32_t i = 0; i < num_threads; ++i)
{
thread_data[i].stats = stats;
thread_data[i].thread_tweets = NULL;
thread_data[i].failed = 0;
thread_data[i].file = file;
thread_data[i].start = i * chunk_size;
thread_data[i].end = (i == num_threads - 1) ? file_size : (i + 1) * chunk_size;
if (i > 0)
{
if (thread_data[i].end < thread_data[i - 1].start)
{
thread_data[i].failed = 1;
continue;
}
}
int ch;
// Adjust start position to the beginning of the next line
if (!is_start_at_line_boundary(file, thread_data[i].start))
{
fseek(file, thread_data[i].start, SEEK_SET);
while ((ch = fgetc(file)) != 'n' && ch != EOF);
thread_data[i].start = ftell(file);
}
// Adjust end position to the end of the line
fseek(file, thread_data[i].end, SEEK_SET);
while ((ch = fgetc(file)) != 'n' && ch != EOF);
thread_data[i].end = ftell(file);
if (ch != 'n' && ch != EOF)
{
thread_data[i].end++;
}
// If they coincide, the chunk was inside a line and the thread shoudnt run
if (thread_data[i].end == thread_data[i].start)
{
thread_data[i].failed = 1;
continue;
}
if (i > 0)
{
thread_data[i].start = last_end;
}
if (pthread_create(&threads[i], NULL, read_file_chunk, &thread_data[i]))
{
fprintf(stderr, "Error creating threadn");
exit(EXIT_FAILURE);
}
last_end = thread_data[i].end;
}
int is_start_at_line_boundary(FILE *file, long start)
{
if (start == 0)
{
return 1; // Start of the file
}
fseek(file, start - 1, SEEK_SET);
if (fgetc(file) == 'n')
{
return 1; // Start is at the beginning of a line
}
return 0;
}
thuruk9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.