Your data might be stored in MS Word files, Outlook mails, or PDFs. Maybe it's buried in specific columns of an Excel workbook located on a website to scrape. Whatever it is, it's unlikely to be the format your extraction model needs. The Format Conversion step of a pipeline transforms the data given to it into a representation your model can handle.
Format conversion can be far more labor-intensive than you think. It ranges from the mundane (MS Word export) all the way to unsolved AI challenges (handwriting recognition). And even the mundane can require a lot of work: what if your model needs an image representation of that MS Word file so that it can do layout analysis?
Format Conversion also includes OCR — optical character recognition — which is the software that converts images into text. OCR for printed business documents is nearing perfect accuracy, while OCR for handwriting and muddled images is still an area under great development. (See this section on the direction of OCR in 2020s for more.)
How much should you care about conversion?
Here are some questions you can discuss with your team to determine how carefully you need to think about format conversion as a part of your overall project.
Questions about Variety
- Will the data always come in the same format?
User-uploaded files and cross-departmental collaborations often make the answer to this question "no," because you're not in control of the input data.
- Will your system be processing PDFs?
PDF is essentially an "anything goes" format, allowing for surprises in production: so-called "text PDFs" can embed images, while "image PDFs" can embed hidden text, and PDFs themselves can contain arcane image formats.
Questions about needing more than just text
- Do I need to preserve the notion of pages?
Does your input format have an abstraction like "pages" that may need to be handled in a special way? For some types of documents, the notion of a "page" is completely arbitrary. For others, it's an essential and intrinsic conveyor of information.
- Is any critical information represented visually?
This might be check boxes, signatures, font, or tabular structure. Or it could be page-level layout and color, such as the division between an article's main content and a sidebar on the same page. If your documents contain this type of data, you will have to decide whether to process that data in this step — probably using an OCR — or preserve it somehow so that the model step can process it later.
- Is any critical information represented with hyperlinks?
If so you may need to extract that information before handing the data to your model.
- Is the format conversion guaranteed to work?
Conversions such as DOCX → Text are near-guaranteed to work, whereas conversions such as Handwriting → Text are likely to be filled with errors. Knowing where on this spectrum you are informs everything from pipeline design to business modeling.
- If the conversion contains errors, will the system be aware of them?
It's not always the case that a computer knows when it's wrong. OCR packages, for example, can often report a confidence value, but this confidence value is yet another automated judgement that can be in error.