This is a question about solving a particular problem I am struggling with, I am parsing a long list of text data, line by line for a business app in PHP (cron script on the CLI). The file follows the format:
HD: Some text here {text here too}
DC: A description here
DC: the description continues here
DC: and it ends here.
DT: 2012-08-01
HD: Next header here {supplemental text}
... this repeats over and over for a few hundred megs
I have to read each line, parse out the HD: line and grab the text on this line. I then compare this text against data stored in a database. When a match is found, I want to then record the following DC: lines that succeed the matched HD:.
Pseudo code:
while ( the_file_pointer_isnt_end_of_file) {
line = getCurrentLineFromFile
title = parseTitleFrom(line)
matched = searchForMatchInDB(line)
if ( matched ) {
recordTheDCLines // <- Best way to do this?
}
}
My problem is that because I am reading line by line, what is the best way to trigger the script to start saving DC lines, and then when they are finished save them to the database?
I have a vague idea, but have yet to properly implement it. I would love to hear the communities ideassuggestions!
Thank you.
4
Separate the problem — one script plows through and reads and stuffs the interesting stuff into some sort of data store. Second script pulls from the data store and processes the records. I suspect this will be much faster than doing it in the same script for no other reason than the 2nd script effectively multi-threads the app.
0
Write a two functions or a class LineReader
with the following functions:
- string GetNextLine() : reads next line from file
- string PeekLine() : gets the next line from file, but don’t move the file pointer
(you can implement this easily by a line buffer consisting of a string variable holding one line in advance; GetNextLine has to make use of that buffer as well as PeekLine).
Then, the implementation of recordDCLines
should be something like
while(substr(PeekLine(),0,3)=="DC:")
{
line=GetNextLine();
// process line, append it to a buffer
}
// here, store the found DC block
EDIT: some pseudo code, I am not experienced in PHP, but I hope you get the general idea:
void OpenFile()
{
// do stuff here to open file
// ...
$nextline = getNextLineFromFile();
$endoffile = false;
}
string GetNextLine()
{
if(isset($nextline))
{
$result=$nextline;
if(!noMoreLinesAvailable())
$nextline = getNextLineFromFile();
else
unset($nextline);
}
else
{
$endoffile=true;
$result ="";
}
return $result;
}
string PeekLine()
{
return $nextline;
}
2
Implement a basic state machine. As you are reading lines, note the last ‘command’ (dc, dt, etc). When you get a ‘HD’, do your lookups. When you are in a DC state, you know to accumulate the message until the next item isnt a DC entry, at which point you do a write.
0
You could consider writing a PHP extension in C or C++ for that purpose; you could then use low-level, but efficient syscalls (e.g. mmap(2), read(2) into a large buffer, readahead(2), etc…)
You could also delegate to a helper program written in C.
1