I need to compare two texts and return the differences in a JSON format. The JSON should contain chunks of unchanged text as well as chunks of original and edited text. Here is an example of the input and desired output:
Example Desired Output:
{
"chunks": [
{"text": "Known as the “King of Pop,” Michael Jackson was a "},
{"text": "bet-selling", "edit": "best-selling"},
{"text": " American singer, songwriter, and dancer. As a child, Jackson "},
{"text": "becamv", "edit": "became"},
{"text": " the lead singer of his family’s popular Motown group, the Jackson 5. H..."}
]
}
Two Texts:
$text1 = "Known as the “King of Pop,” Michael Jackson was a bet-selling American singer, songwriter, and dancer. As a child, Jackson becamv the lead singer of his family’s popular Motown group, the Jackson 5. H...";
$text2 = "Known as the “King of Pop,” Michael Jackson was a best-selling American singer, songwriter, and dancer. As a child, Jackson became the lead singer of his family’s popular Motown group, the Jackson 5. H...";
What I have tried
-
JfcherngDiff Library (https://github.com/jfcherng/php-diff):
I tried using the JfcherngDiff library with custom rendering to achieve the desired output. However, it didn’t handle unchanged text chunks properly and didn’t split the changes at the word level as needed. -
Unix diff via shell_exec:
Attempted using Unix diff via shell_exec to generate the differences and parse the output. The results were not in the expected format and included additional metadata not relevant to our use case. -
Manual Text Splitting:
Tried manually splitting texts into words and comparing them, but this approach lacked efficiency and accuracy.
Current Code Implementation
Here is the current implementation using JfcherngDiff library:
<?php
use JfcherngDiffDiffer;
use JfcherngDiffRendererAbstractRenderer;
use JfcherngDiffSequenceMatcher;
class CustomJsonTextRenderer extends AbstractRenderer
{
public const INFO = [
'desc' => 'Custom renderer',
'type' => 'custom',
];
public const IS_TEXT_RENDERER = true;
public function getResultForIdenticalsDefault(): string
{
return '[]';
}
protected function renderWorker(Differ $differ): string
{
$diffs = $differ->getGroupedOpcodes();
$result = [];
foreach ($diffs as $hunk) {
foreach ($hunk as [$tag, $i1, $i2, $j1, $j2]) {
if ($tag === SequenceMatcher::OP_EQ) {
$result[] = [
'text' => implode('', array_slice($differ->getOld(), $i1, $i2 - $i1))
];
} elseif ($tag === SequenceMatcher::OP_REP) {
$oldText = implode('', array_slice($differ->getOld(), $i1, $i2 - $i1));
$newText = implode('', array_slice($differ->getNew(), $j1, $j2 - $j1));
$result = array_merge($result, $this->splitIntoChunks($oldText, $newText));
} elseif ($tag === SequenceMatcher::OP_DEL) {
$result[] = [
'text' => implode('', array_slice($differ->getOld(), $i1, $i2 - $i1)),
'edit' => ''
];
} elseif ($tag === SequenceMatcher::OP_INS) {
$result[] = [
'text' => '',
'edit' => implode('', array_slice($differ->getNew(), $j1, $j2 - $j2))
];
}
}
}
return json_encode($result, JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
}
protected function splitIntoChunks(string $oldText, string $newText): array
{
$chunks = [];
$oldWords = preg_split('/(b|s+)/u', $oldText, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$newWords = preg_split('/(b|s+)/u', $newText, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$differ = new Differ($oldWords, $newWords);
$diffs = $differ->getGroupedOpcodes();
foreach ($diffs as $hunk) {
foreach ($hunk as [$tag, $i1, $i2, $j1, $j2]) {
if ($tag === SequenceMatcher::OP_EQ) {
$chunks[] = [
'text' => implode('', array_slice($differ->getOld(), $i1, $i2 - $i1))
];
} elseif ($tag === SequenceMatcher::OP_REP) {
for ($i = $i1, $j = $j1; $i < $i2 && $j < $j2; ++$i, ++$j) {
$chunks[] = [
'text' => $differ->getOld()[$i],
'edit' => $differ->getNew()[$j]
];
}
} elseif ($tag === SequenceMatcher::OP_DEL) {
for ($i = $i1; $i < $i2; ++$i) {
$chunks[] = [
'text' => $differ->getOld()[$i],
'edit' => ''
];
}
} elseif ($tag === SequenceMatcher::OP_INS) {
for ($j = $j1; $j < $j2; ++$j) {
$chunks[] = [
'text' => '',
'edit' => $differ->getNew()[$j]
];
}
}
}
}
return $chunks;
}
}
Request for Help
Does anyone know of an existing library that provides this functionality out of the box?
Any suggestions on improving the current implementation to handle unchanged text correctly and split the changes at the word level?
Any help or guidance would be greatly appreciated. Thank you!