Document Segmentation

In many situations, the documents you are processing contain far more information than the part you are interested in. Consider these examples:

  • a PDF produced by a flat-bed scanner; each PDF page is actually many documents strewn about the scanner glass
  • a 400-page report; each page contains a header and footer that needs removing so that they aren't mistaken for the text body
  • a newspaper website whose main content is cluttered with advertising and navigational elements
  • a PDF containing the paystub of everyone in your department; each page is really a completely separate record

If your input documents are complex in these ways, you may need to perform segmentation. Segmentation is the process of cutting up a document into pieces so that you can use those pieces separately (or throw some away).

How you approach segmentation will fall into one of three buckets.

Case 1: It doesn't matter

Perhaps you are only trying to classify the document type. Or extract key names and fields mentioned in a document. You might be able to ignore extra information in the document entirely.

You're in this category if the following conditions hold true:

  1. You're not concerned with "competing information"
    Competing information is when your input contains multiple alternative options that will compete for the output of your model. For example, a PDF might contain multiple pages with paystubs on them — how would your model know which one to use?
  2. You're not concerned with "junk information"
    Junk information is extraneous content that might be mixed in with your model output. For example, a page header in a multi-page PDF might fall right in the middle of a paragraph you are seeking to extract.
  3. You're not concerned with "model input overflow"
    Some models can only accept inputs of a certain size. That might be a single page image, or a character limit on text. If your documents are larger than the input capacity of your model, then you'll need some strategy for segmenting and extracting a smaller region of the document.

Case 2: It matters, but the model will handle it

Some models can essentially segment and extract at the same time. The team responsible for your model will know if this is the case, but in general it will fall into one of these two categories.

  1. The model is spatially aware
    There are ways to make a model spatially aware in both the image and text domains. This is akin to adding another dimension to the words on the page — a dimension that can help the model meaningfully group and filter words based on where they occur. For example, the text "John Doe," which might otherwise only be seen as a name, could be identified specifically as the recipient of an invoice because it was located in the region of the document the model interprets as the address block for the recipient.
  2. The model is discursively aware
    This is the same as spatial awareness, but "space" in this case is the semantics of the document's content, not the layout of the document's pages. For example, newer deep learning models have what is called "multi-head attention" mechanisms, which allow them to keep tabs on the fact that the text "John Doe" is not just occurring in any paragraph, but specifically a paragraph in the section of a document about board members.

Case 3: It matters, and the pipeline should handle it

If the first two cases don't apply, you will need your pipeline to perform some level of segmentation for you. And even if Case 2 applies, you may still want the pipeline to segment to some degree. Here are three scenarios that would place you clearly in Case 3:

  1. A single "input document" contains multiple "output documents"
    Perhaps each input file is an archival PDF containing the receipts for every business trip an employee went on last year, but your model is a receipt data extractor. You'll need the pipeline to split this input file by receipt and provide each to the model separately.
  2. Regions of the input document are mapped to different models
    Perhaps your pipeline accepts account opening requests, but each request is a combination of multiple documents concatenated into the same PDF. You may need your pipeline to split the input file for delivery across different document-specific models.
  3. You want to apply hard-coded rules/policies in addition to a model's learned segmentation
    Even if your model learns to segment and extract at the same time, you may still want to apply hard coded segmentation rules for business reasons. Perhaps you want to redact every page with certain types of information on them, or route a very specific set of documents to a human-only team.