I’m working on a C# project that involves parsing WordPerfect documents to identify and process nested forms. The program is called WP_Mapper, and it uses the WP_Reader library to parse WordPerfect documents in the WP6x format. The goal is to create a dependency map of parent/child links between different forms. I’ve attached Git links to both my project and the WP_Reader project I am using.
Program Overview
- The user selects an initial WordPerfect file to map.
- The program parses the document and looks for any nested forms using the tags.
- If nested forms are found, the program recursively parses those forms and builds a visual tree structure showing the dependencies.
My Code (also on git using the link above):
using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
using System.Windows.Forms;
using WP_Reader;
namespace WP_Mapper
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void btnSelectFile_Click(object sender, EventArgs e)
{
using (OpenFileDialog openFileDialog = new OpenFileDialog())
{
openFileDialog.Filter = "WordPerfect files (*.wpd;*.frm)|*.wpd;*.frm|All files (*.*)|*.*";
if (openFileDialog.ShowDialog() == DialogResult.OK)
{
string filePath = openFileDialog.FileName;
WP6Document doc = new WP6Document(filePath);
TreeNode rootNode = new TreeNode(filePath);
treeViewDocuments.Nodes.Add(rootNode);
ParseDocument(doc, rootNode);
}
}
}
private void treeViewDocuments_AfterSelect(object sender, TreeViewEventArgs e)
{
// Handle tree view selection event if needed
}
private void ParseDocument(WP6Document doc, TreeNode parentNode)
{
// Accumulate the entire document content into a single string
string documentContent = string.Empty;
foreach (WPToken token in doc.documentArea.WPStream)
{
if (token is WP_Reader.Character character)
{
documentContent += character.content;
}
else if (token is WP_Reader.Function function)
{
documentContent += $"<{function.name}>";
}
}
// Log the accumulated document content for debugging
string logFilePath = @"C:pathtooutputdocumentContent.txt";
File.WriteAllText(logFilePath, documentContent);
MessageBox.Show($"Document content logged to: {logFilePath}");
// Use a flexible regex to find <merge> tags
var mergeRegex = new Regex(@"<merge>(.*?)</merge>", RegexOptions.Singleline);
var matches = mergeRegex.Matches(documentContent);
if (matches.Count == 0)
{
MessageBox.Show("No <merge> tags found.");
return;
}
foreach (Match match in matches)
{
string nestedFilePath = match.Groups[1].Value;
// Debug output
Console.WriteLine($"Found nested form: {nestedFilePath}");
MessageBox.Show($"Found nested form: {nestedFilePath}");
TreeNode childNode = new TreeNode(nestedFilePath);
parentNode.Nodes.Add(childNode);
// Recursively parse the nested form
try
{
if (File.Exists(nestedFilePath))
{
WP6Document nestedDoc = new WP6Document(nestedFilePath);
ParseDocument(nestedDoc, childNode);
}
else
{
MessageBox.Show($"Nested file not found: {nestedFilePath}");
}
}
catch (Exception ex)
{
MessageBox.Show($"Error parsing nested document: {ex.Message}");
}
}
}
}
}
The raw parsed text that WP_Reader gets from the initial selected file is below. In it, you can clearly see the part that has [File Path of Nested Form That I Want To Map]. Ideally, if there were three levels of nested forms (document 1 has document 2 nested in it, and document 2 has document 3 nested in it), then it would parse each one and create the parent/child node graph. I just for the life of me cannot get it to find the second file using the parsed text:
<global_on><set_language><global_off><check_as_you_go>TEST<hard_eol><hard_eol><check_as_you_go><merge>C:UsersdylanDesktopTEST1A.frm<merge><check_as_you_go><hard_eol><hard_eol>TEST<left_tab>
If anyone can help it would be much appreciated, I feel like I am missing something right in front of my face and it is driving me up a wall.
- Confirmed that WP_Reader successfully parses the document and the tags with file paths are present in the raw parsed text.
- Logged the raw parsed text to a file and visually inspected it. The tag and file path don’t appear to have any major formatting issues that would be an obvious issue.
- Simplified the regex pattern to “([^<]+)” to make it more flexible and less restrictive. Updated regex pattern to handle potential variations in tag structure and whitespace: <mergesr=””([^””]+)””s/?>**. Simplified regex to match the exact structure of your example output: **(.?)</merge>**.
- Attempted to exclude irrelevant tags and only accumulate meaningful content.
- Identified issues with unexpected tags like <soft_space> and <left_tab>. Refined the accumulation logic to exclude these tags and focus on tags.
- Tried to get both Google Gemini and ChatGPT to figure it out with no avail.