Loan Think

For Best Results, Regularly Test and Validate Data Extraction Tech

By Brian Fitzpatrick December 18, 2014, 12:09 p.m. EST 4 Min Read

Mortgage lenders are turning to Optical Character Recognition software to automate the process of culling data from loan documents, but most don't attain the results they anticipated. OCR replaces a manual, error-prone process, with an automated and efficient one — and that's the appeal of the technology.

A growing number of senior executives see OCR as the solution that enables them to recognize, collect and extract the data needed to verify mortgage loan quality. That data can be used to drive rules creating a less expensive, more accurate, mortgage audit process.

OCR sounds like a great way to replace manual processes, and, in theory, it is. But in practice, lenders are often disappointed with the performance and return to the manual approach, often after spending millions of dollars to implement it.

One reason the technology fails is because OCR software is only a small component of an overall quality management capability, and can be doomed by a strategy of "set it up once and forget it." While OCR can recognize documents based on key phrases and anchor terms, it is not a fool-proof approach.

It may work well for individual documents, but mortgage files are often 450 pages or more containing dozens of documents and forms. Off-the-shelf OCR technology offers only limited capabilities: OCR software cannot easily or effectively identify boundaries between documents. It becomes even more complex when one or more pages from one document are inadvertently placed in the middle of another or there are multiple versions of the same document.

For example, a bank statement that has eight pages followed by the uniform residential loan application may be misidentified as a bank statement with nine pages that includes the first page of the loan application. The file would be flagged as missing the mortgage application, when in reality the technology simply failed to identify that page appropriately.

User experience indicates that deploying standalone OCR hasn't work well. Lenders suffer indexing error rates of 40%-50%, with some rates as high as 70%. Lenders cannot profitably operate with error rates of that magnitude.

The benign-neglect approach to deploying OCR technology enjoys wide adoption in the industry, but most senior executives and technologists don't realize it is often the reason the technology fails.

Some forward-looking lenders are adopting a sophisticated automated work flow and data-integrity model that supports their processes and allows them to test results and to continually validate data on the back-end.

They implement multiple OCR engines with sophisticated workflow capabilities based on voting or accuracy scoring engines, build rules, test continuously, and make changes that ensure low confidence documents or data elements are routed for validation so that errors are not repeated.

The automated workflow and data-integrity approach provides the multitier, flexible documentation workflow that's required and the ability to verify and reverify data. Additionally, the data used downstream is based on order of priority or confidence. Later, that data is validated and compared side by side against the same data element captured from multiple documents in the loan file. This provides a high-definition view of the data and allows for quick and efficient inspection of key data elements and the ability to determine problems in the file.

Additionally, automated rules run against the validated and verified data eliminate the error prone, manual approach, yet allow employees to monitor and analyze the results. Error trends are identified and the technology is "trained" and improved on an on-going basis.

When demonstrating OCR software, vendors will often take "perfect" loan files, scanned with high-resolution documents that contain very few unstructured, "surprise" documents. The results in the OCR platform show how precisely it worked. In the real world, lenders neither create nor buy perfect, high-resolution, nonsmudged, perfectly legible loan files, especially in the correspondent-lending channel.

Once lenders adopt the multitier approach, the technology operates at full potential, generating the efficiencies that have been promised — including better than 90% accuracy rates. That improvement is achieved because lenders can more accurately and efficiently identify both structured and unstructured documents. Nonstandard documents, like a rate-lock agreement, have almost as many versions as there are lenders.

Advanced technology can be trained to identify, with pinpoint accuracy, structured documents that include common terms. Nonstandard and unstructured documents, like gift letters, have inconsistent wording and no common structure, and so require different workflow processing and have lower confidence scoring rates. The platform therefore needs to identify and forward them, via workflow, to skilled employees who can identify the document.

The old OCR software deployment approach that many firms have taken has undermined the value of the technology and resulted in very expensive disappointments. However, when deployed as part of a multitier approach, OCR can play a significant role in the mortgage process, a role that reduces errors, cuts costs, speeds processes, improves accuracy and, above all, provides a competitive advantage.

Brian K. Fitzpatrick is president and CEO of LoanLogics and a member of its board of directors.

Brian Fitzpatrick