I am working in a legacy java application, where Ghostscript is being used from the command line to repair pdfs:
private static File repairLinuxGhostScript(String filePath) throws IOException{
PerfLog.snapShot();
tempFile = createTempFileInTeradactorTempDirectory(filePath.substring(0,filePath.length()-4) + "-repaired", ".pdf");
CommandLine cmd = new CommandLine("gs");
cmd.addArgument("-o");
cmd.addArgument(tempFile.getAbsolutePath());
cmd.addArgument("-dPDFSETTINGS=/prepress"); //selects output similar to Acrobat Distiller "Prepress Optimized" (up to version X) setting. https://ghostscript.com/docs/9.54.0/VectorDevices.htm
cmd.addArgument("-sDEVICE=pdfwrite"); //The pdfwrite, ps2write and eps2write devices create PDF or PostScript files whose visual appearance should match, as closely as possible, the appearance of the original input file https://ghostscript.com/docs/9.54.0/VectorDevices.htm
cmd.addArgument("-dSubsetFonts=false"); //
cmd.addArgument(filePath);
DefaultExecutor exec = new DefaultExecutor();
ByteArrayOutputStream outputCapturer = new ByteArrayOutputStream();
exec.setStreamHandler(new PumpStreamHandler(outputCapturer, System.err));
System.out.println(cmd.toString());
int exitValue = exec.execute(cmd);
LOG.info(outputCapturer.toByteArray());
if(exitValue != 0) {
throw new ProcessorServiceException("Ghostscript gs exited with error code " + exitValue);
}
PerfLog.log(PerfLog.Phase.IMPORT, PerfLog.SubPhase.REPAIR,"PDF Repair","GhostScript","PdfRepairer");
return tempFile;
}
We want to get rid of temporary files so I rewrote the method to this:
public static byte[] repairPDF(byte[] inputPDF) throws IOException {
// Prepare Ghostscript command
CommandLine cmd = new CommandLine("gs");
cmd.addArgument("-dPDFSETTINGS=/prepress");
cmd.addArgument("-sDEVICE=pdfwrite");
cmd.addArgument("-dSubsetFonts=false");
cmd.addArgument("-sOutputFile=%stdout"); // Output to stdout
cmd.addArgument("-");
// Set up input and output streams
ByteArrayInputStream inputStream = new ByteArrayInputStream(inputPDF);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ByteArrayOutputStream errorStream = new ByteArrayOutputStream();
PumpStreamHandler streamHandler = new PumpStreamHandler(outputStream, errorStream, inputStream);
DefaultExecutor executor = new DefaultExecutor();
executor.setStreamHandler(streamHandler);
// Execute Ghostscript command
try {
executor.execute(cmd); // Send input PDF via stdin
} catch (ExecuteException e) {
System.err.println("Execution failed: " + e.getMessage());
throw new IOException("Ghostscript execution failed", e);
} catch (IOException e) {
System.err.println("IO Error: " + e.getMessage());
throw new IOException("Error processing PDF", e);
}
// Handle errors if any
String errorOutput = errorStream.toString();
if (!errorOutput.isEmpty()) {
System.err.println("Ghostscript Error: " + errorOutput);
}
// Get the result as a byte array
return outputStream.toByteArray();
}
However, both methods return different results. The pdf from method 1 is not equal to the pdf of method 2 (some text disappears, some images,…).
I also read somewhere here on stack overflow that Ghostscript needs to read the files from disk anyway, so it doesn’t really make a difference whether I would remove temporary files on my end, since they get created anyway (source: /a/49880072/12171869).
I am wondering: is there a way to achieve the same result as the initial method without creating temporary files? I have taken a look at PDF box but haven’t gotten the desired result (note that I am working with pdf box 1.8.11, since legacy application)
Thank you.