Thiết kế website giá rẻ

Question

def format(json_data):
    """
    Extracts document title, headings, subheadings (if present), and content in a specific JSON format.

    Args:
        json_data: A dictionary containing the parsed JSON data of the Google Doc.

    Returns:
        A list containing a dictionary with the document title and another dictionary
        for each heading with its subheading (if any) and content.
    """
    extracted_data = []
    current_heading = None
    current_subheading = None
    current_heading_level = None
    stack = []

    for element in json_data["body"]["content"]:
        if "paragraph" in element:
            paragraph = element["paragraph"]
            paragraph_style = paragraph.get("paragraphStyle", {})
            element_type = paragraph_style.get("namedStyleType")

            if element_type == "TITLE":
                title = paragraph["elements"][0].get("textRun", {}).get("content", "")
                extracted_data.append({"title": title.strip()})
            elif element_type and element_type.startswith("HEADING"):
                heading_text = paragraph["elements"][0].get("textRun", {}).get("content", "").strip()
                heading_level = int(element_type.split("_")[-1])

                if current_heading is None:
                    # This is the first heading encountered, so it's the main heading
                    current_heading = heading_text
                    current_heading_level = heading_level
                    current_entry = {
                        "heading": current_heading,
                        "sub-headings": [],
                        "content": []
                    }
                    extracted_data.append(current_entry)
                    stack.append((current_entry, heading_level))
                else:
                    # Determine where to place the new heading
                    while stack and stack[-1][1] >= heading_level:
                        stack.pop()
                    
                    current_entry = {
                        "heading": heading_text,
                        "sub-headings": [],
                        "content": []
                    }

                    if stack:
                        stack[-1][0]["sub-headings"].append(current_entry)
                    else:
                        extracted_data.append(current_entry)

                    stack.append((current_entry, heading_level))
            else:
                content_text = paragraph["elements"][0].get("textRun", {}).get("content", "").strip()
                if content_text and stack:
                    stack[-1][0]["content"].append(content_text)

    return extracted_data

The main aim of this code is to format and structure a document (pdf/google doc) contextually , so that we can feed this to an LLM in a RAG framework for proper contextual training of the model in return expecting precise answers to the questions asked.

Now by contextually, I mean, I am trying to structure the document according to its headings and subheadings and their respective content into a JSON.

The above function parses and format a Google Document using google docs API, the API has a function which returns a JSON for any google document id, and returns JSON in the below format:

  {
    "title": "Improving Performance of OpenCL on CPUs",
    "size": 0.05078125,
    "word_count": 0,
    "char_count": 0
  },
  {
    "heading": "Compiler Construction",
    "sub-headings": [],
    "content": [
      "21st International Conference, CC 2012",
      "(Summarized till Exploiting Uniform Computations of the article)"
    ],
    "size": 0.1748046875,
    "word_count": 2,
    "char_count": 2
  },
  {
    "heading": "Abstract",
    "sub-headings": [],
    "content": [
      "Data-parallel languages like OpenCL and CUDA are an important means to exploit the computational power of todayu2019s computing devices.",
      "In this paper, we deal with two aspects of implementing such languages on CPUs: First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data flow conversion, which is the commonly used technique to leverage vector",
      "instruction sets.",
      "Second, we present a novel technique to implement barrier synchronization. We evaluate our techniques in a custom OpenCL CPU driver which is compared to itself in different configurations and to proprietary implementations by AMD and Intel. We achieve an average speedup factor of 1.21 compared to nau00a8u0131ve vectorization and additional factors of 1.15 -- 2.09 for suited kernels due to the optimizations enabled by our analysis. Our best configuration achieves an average speedup factor of over 2.5 against the Intel driver"
    ],
    "size": 1.00390625,
    "word_count": 4,
    "char_count": 4
  },
  {
    "heading": "Summary",
    "sub-headings": [
      {
        "heading": "Another Introduction Subheading-2",
        "sub-headings": [],
        "content": [
          "The introduction section of the article delves into the optimization of data-parallel languages, specifically focusing on achieving maximum performance of OpenCL on CPUs through whole-function vectorization of kernels. It emphasizes that while"
        ]
      },
      {
        "heading": "Another Introduction Subheading-3",
        "sub-headings": [
          {
            "heading": "Subheading-4",
            "sub-headings": [],
            "content": [
              "The article introduces key techniques aimed at reducing overhead based on the analysis of divergent control flow. By analyzing divergent control flow, the article aims to identify areas where control-flow to data-flow conversion can be minimized, thereby improving performance. The discussion revolves around the importance of retaining control flow and values where possible to optimize performance effectively.",
              "One of the primary techniques highlighted in the introduction is control-flow linearization. This technique involves building regions of divergent blocks using a depth-first search on the control-flow graph (CFG). By identifying varying branches and marking active regions, the algorithm aims to exclude non-divergent control-flow structures, thereby reducing unnecessary overhead. The article emphasizes the significance of this technique in optimizing data-parallel languages like OpenCL and CUDA on CPUs.",
              "Furthermore, the introduction touches upon the challenges associated with formal proofs of transformations in data-parallel languages. It acknowledges the absence of a formal semantics for languages like OpenCL, which hinders the ability to provide formal proofs of correctness for the proposed optimizations. Despite this limitation, the article asserts the success of the techniques through practical benchmarking and performance evaluations, showcasing their superiority over existing proprietary drivers by Intel and AMD.",
              "The introduction also sets the stage for discussing code generation techniques to address the inherent overhead in data-parallel languages. By integrating parts of the driver code into the kernel and utilizing a synchronization scheme based on continuations, the article aims to enable aggressive optimizations. These techniques are designed to enhance the performance of data-parallel languages on CPUs and have demonstrated success across a variety of benchmarks, outperforming proprietary drivers by leading industry players.",
              "In summary, the introduction provides a comprehensive overview of the challenges and strategies involved in optimizing data-parallel languages on CPUs. It underscores the importance of targeted optimizations, control-flow linearization, and divergence analysis in reducing overhead and maximizing performance. The article sets the foundation for exploring key techniques and methodologies that contribute to the successful optimization of OpenCL and CUDA on CPUs, ultimately leading to significant performance improvements in benchmarking scenarios."
            ]
          }
        ],
        "content": [
          "vectorizing code is a crucial technique for enhancing performance, a naive approach of vectorizing all code can lead to significant overhead due to the conversion of control-flow to data-flow. This overhead can limit the benefits of vectorization, highlighting the need for targeted optimizations to mitigate these challenges."
        ]
      }
    ],
    "content": [
      "Hello fbbvbcehgf - content content content content content content content content content content content content content content content content content"
    ],
    "size": 3.4716796875,
    "word_count": 15,
    "char_count": 15
  },
  {
    "heading": "OpenCL Driver Implementation:",
    "sub-headings": [],
    "content": [
      "The OpenCL Driver Implementation section delves into the strategies employed to enhance the efficiency of an OpenCL driver. The compilation scheme outlined in this section includes several key steps aimed at optimizing the driver's performance. These steps are detailed in the subsequent subsections."
    ],
    "size": 0.3720703125,
    "word_count": 1,
    "char_count": 1
  },

(snippet of the JSON returned)

You can refer the document here: https://docs.google.com/document/d/1jweGSTkDI3lASzm2SVw7AmLeqUA6_JmVacpMFcZJh4Y/edit?usp=sharing

Here, we can distinguish between headings and subheadings in a google doc, by knowing that HEADING_1>HEADING_2>HEADING_3>.., in terms of size,

Now the main logic here is quite simple:

Having a marker and if the first encountered:

As we parse through the API returned JSON, Have a marker which identifies headings, and as we parse through the JSON if the second heading(HEADING_2) encountered is smaller than the first heading (HEADING_1), append that as a sub-heading , the sub-headings can be as many ,as long as the size is smaller than the previous marked heading (as the marker keeps going though the JSON), the next heading should only be when the marker finds the same size first encountered heading(HEADING_1) (which is not a sub-heading), this process should keep going through the JSON as it parses page by page and formats it. The code should also handle nested subheading according to the logic.

We are using a stack to keep track of the current heading encountered.

You can refer the Google Docs API returned JSON format here: https://developers.google.com/docs/api/samples/output-json

Now the Google Docs parsing has worked according to my expectations, now I am trying to replicate the same logic above in PDF processing using pdfplumber, now here to identify headings/subheadings , i am considering that headings/subheadings are bold in format.

With pdfplumber we can identify the bold characters and also the size, fontname of the characters, which will help us to distinguish between headings and subheadings.

Now I am having a problem the replicate the above logic in pdf processing using pdfplumber.

This is what i tried till now:

def process_page(page, stack, extracted_data, marker_fontname, marker_fontsize):
    lines = page.extract_text_lines()
    current_paragraph = []

    for line in lines:
        line_text = line["text"].replace("-", "").strip()
        if not line_text:
            continue

        # Check for font sizes and names to identify headings and subheadings
        line_fonts = [(char["fontname"], char["size"]) for char in line["chars"]]
        bold_fonts = [(char["fontname"], char["size"]) for char in line["chars"] if "Bold" in char["fontname"]]

        if bold_fonts:
            fontname, fontsize = bold_fonts[0]
            is_bold = True
        else:
            fontname, fontsize = line_fonts[0]
            is_bold = False

        if is_bold:
            if marker_fontname is None or marker_fontsize is None:
                heading_level = 1
                marker_fontname = fontname
                marker_fontsize = fontsize
            else:
                if fontname == marker_fontname and fontsize == marker_fontsize:
                    heading_level = 1
                elif fontsize < stack[-1][1]:
                    heading_level = stack[-1][1] + 1
                else:
                    heading_level = stack[-1][1]

            if current_paragraph:
                stack[-1][0]["content"].append("n".join(current_paragraph))
                current_paragraph = []

            while stack and stack[-1][1] >= heading_level:
                stack.pop()

            current_entry = {
                "heading": line_text,
                "sub-headings": [],
                "content": []
            }

            if stack:
                # Nest under the last heading in the stack
                stack[-1][0]["sub-headings"].append(current_entry)
            else:
                # No stack, add as top-level entry
                extracted_data.append(current_entry)

            stack.append((current_entry, heading_level))

        else:
            if line_text:
                current_paragraph.append(line_text)

    if current_paragraph:
        stack[-1][0]["content"].append("n".join(current_paragraph))

    # Return the processed formatted extracted data
    return extracted_data

This is the snippet of the output JSON:

[
    {
        "heading": "V I T C I N T R A M U N ' 2 2",
        "sub-headings": [],
        "content": [
            "International Press  PhotojournalismnBackground Guide"
        ]
    },
    {
        "heading": "LETTER FROM EXECUTIVE BOARD",
        "sub-headings": [],
        "content": [
            "Hello and welcome to all the creative minds,nPhotojournalism is a vital branch of the PRESS which connects people to thenhappenings of their surroundings through art. In the realworld, this particularntask is extremely challenging because of several political factors. In thisnsimulation of the international press, the photojournalists will be given thenutmost freedom to express their ideas and opinions through the art form.nThis simulation aims to achieve the following goals at the end of thenconference:nEducate the photographers about the significance and nuances ofnphotojournalism.nTrigger the creative minds to achieve the bigger goal of producing a specificnkind of art where the photographs deliver strong messages.nTo help the photographer learn, improve and enjoy what they already lovendoing clicking pictures.nThis simulation will include extensive discussion sessions where thenphotographers can interact with one another and with the executive board, asnan attempt to share knowledge. These sessions will include individual feedbacknfor each photojournalist after each submission.nHope all the photographs take away an enormous amount of knowledge andnlearning from this simulation.nYours sincerely,nAbdul RaziqnHead of photographynIP | PAGE 2"
        ]
    },
    {
        "heading": "CAMERA BASICS FOR BEGINNERS",
        "sub-headings": [],
        "content": []
    },
    {
        "heading": "What is the Exposure Triangle?",
        "sub-headings": [],
        "content": [
            "Aperture, shutter speed, and ISO—otherwise known as the "Exposure Triangle"—nare a trio of camera functions that are responsible for the exposure (the level ofnlight and dark tones) in your image.nThese three functions play a large role in the overall look and effect of yournimage, and mastering their use is essential to becoming a better photographer,nespecially when shooting in manual mode.nThe exposure triangle consists of three fundamentals that can be controlled tonget the expected exposure. These three fundamentals are;"
        ]
    },
    {
        "heading": "Aperture",
        "sub-headings": [],
        "content": [
            "This is the opening in the lens that allows light to reach the sensor. The widernthe opening, the more light that enters. The size of the aperture is measured innfocal stops or "fstops," and is based on the diameter of the hole through whichnlight enters the camera. The smaller the hole, the higher the fstop number, andnconsequently, the darker the image.nAside from controlling the brightness of the photo, aperture is also used tondetermine depth of field.nWide Aperture (Low fstop number): If you want an image where the subject isnin sharp focus but the surrounding area is blurred, you'll need a wide aperture,nor a low fstop number. This is best for photography where you want thensubject to stand out, such as portrait, wildlife, or sports photography.nSmaller Aperture (High fstop number): Conversely, if you want more of thenarea around the subject to be brought to focus, you'll need to shoot with ansmaller aperture or a higher fstop number, which will expand the depth ofnfield. This is best for photography where you want to focus on backgrounds,nsuch as landscape and nature photography.nIP | PAGE 3"
        ]
    },
    {
        "heading": "Shutter Speed",
        "sub-headings": [],
        "content": [
            "Shutter Speed is the speed with which the camera’s shutter opens and closes.nFast shutter speeds allow photographers to freeze motion while a slowernshutter speed is used to blur motion. The former is helpful for situations wherenyou want the subject to be captured perfectly midmotion, while the latter isngreat for capturing the motion itself, such as the look of a speeding train, or thengradual shift of stars in the sky.nShutter Speed Avoiding Unwanted Blur: Below a certain shutter speed, a tripodnor stabilizer may be necessary to maintain a sharp image. Especially if you havenshaky hands, camera movement—no matter how minimal—can cause the imagento blur. Unless your lens has builtin stabilization, the minimum setting fornkeeping your handheld photos sharp varies depending on the focal length ofnyour lens. In short, the larger the focal length, the greater the risk of cameranshake."
        ]
    },
    {
        "heading": "ISO",
        "sub-headings": [],
        "content": [
            "Lastly, the final side of the triangle is the ISO. This is what affects thensensitivity of the sensor to light. Generally, a higher ISO is used during lowlightnsituations, allowing the sensor to absorb as much of the available light asnpossible. On the flip side, in order to avoid overexposure, a lower ISO is usednwhen photographing scenes with lots of light.nOne thing to be wary of—especially when shooting in lowlight situations—isnthat higher ISOs can result in a noisy image. What we mean by "noise" is anspeckled, grainy appearance on your photo. Unless grainy is the look you'rengoing for, noise reduces the quality of the photo. As such, it's best to keep yournISO as low as possible to keep your image sharp and crisp. If that's not possible,none solution could be to use accessories that create artificial light in lowlightnscenes, such as flash."
        ]
    },
    {
        "heading": "Managing Exposure",
        "sub-headings": [],
        "content": [
            "As with all things, balance is necessary when fiddling with the exposurensettings. OVEREXPOSURE can result in an image that is washed out or overlynbright. UNDEREXPOSURE can result in a dark photo with a lack of detail. Thenbest level of exposure is one in which the image looks natural, with a balance ofnboth light and dark areas. Most digital cameras come with the following toolsnto help you manage your exposure; Exposure or Light Meter, ExposurenCompensation and Histogram."
        ]
    },
    {
        "heading": "BASIC COMPOSITION RULES",
        "sub-headings": [],
        "content": []
    },

And this would be the PDF I am trying to parse: https://drive.google.com/file/d/11ScASeg7gLiXSdaUdl3nOOO0qCKajVhm/view?usp=sharing

As you can see that the subheadings are not being nested properly.

I know there are many services to get this job done, but I am trying solve this with a no service/api.

Thiết kế website giá rẻ

Danh mục

Contextual chunking of PDF’s content ,having a problem to replicate logic to nest headings and subheadings while parsing the PDFs