Using the best data extraction solution improves your company’s document automation rate leading to faster, more efficient processes with fewer manual errors and happier employees.
We have been benchmarking Hypatos against our competitors during client PoCs with satisfactory results and wanted to share a similar comprehensive benchmarking exercise publicly to ensure that potential customers are aware of the performance of our solution. We compared our solution against the competition in terms of accuracy and features such as deployment options, ease of integration etc.
What is the difference between data extraction and OCR?
In short,
- OCR turns documents into text which is a form of unstructured data which needs to be processed by humans
- Data extraction solutions provide structured data which is machine readable
Therefore, data extraction solutions enable documents to be automatically processed. For more, feel free to read our OCR article where we explain the difference between OCR and data extraction.
How can we determine the best data extraction solution?
Any AI solution can be measured against its competitors by comparing its accuracy against manually labeled data. This approach forms the basis of most PoC projects by large companies. These companies ask several leading vendors to produce predictions based on their data which has been manually labelled. The accuracy of these solutions is an important input to the companies’ procurement decision.
What is the most accurate data extraction solution?
As you can see below, Hypatos was by far the best solution for these documents in terms of both
- Number of all entities extracted
- Accuracy of all extracted entities
Another important metric is crucial fields. If companies are not interested in discovering the insights in their spending, they can capture just the critical fields necessary to make a payment and record key aspects of the transaction in SAP. Hypatos was again the best solution in terms of both
- Number of crucial fields extracted
- Accuracy of crucial extracted fields
For most clients, crucial fields include:
- Invoice Number
- Document Type
- Invoice Date
- Service time
- Net Amount
- VAT rate
- Total VAT amount
- Total Gross amount
- Currency
- Sender Name
Sample used in benchmarking
We used a relatively small set of 10 invoices from Germany in this initial benchmarking exercise. A major limitation on the sample size is that we needed to use documents which may need to be shared publicly. Because we wanted to be able to share the data set with the tech press and potential customers so they could reproduce our results if they want to. Therefore, we relied on invoices that we received and could not use any of our customers’ documents.
Methodology
We could only benchmark Hypatos against other solutions that offered trial products, but we believe we covered all modern data extraction solutions that deal with semi structured documents including offers, orders, invoices, receipts payslips etc. We excluded solutions that focus on a single type of document as we have seen our clients use our services for multiple types of documents and we have not seen demand for document specific solutions from enterprise clients.
Of course, feel free to add a comment here if you think another products should be listed here. Products we used in the benchmarking are:
- Textract available on AWS
- Sypht: Free account available on sypht.com
Are there other criteria that could affect a companies’ procurement decision?
Accuracy is not the only factor in the decision. Deployment options, ease of integration and advanced processing options are also important metrics and we benchmarked the sample companies against these metrics:
Company | Deployment options | Integration options | Advanced processing options |
---|---|---|---|
Hypatos | On-prem Private cloud (AWS/MS Azure) Public cloud | API Integration to document workflow tools such as Kofax | VAT compliance check Account prediction |
Sypht | N/A | ||
Textract | Public cloud Private cloud (AWS) | API | N/A |
This table is based on public data and we are happy to update if the benchmark companies share more details with us. Continue below for a more detailed explanation of these metrics:
Deployment options
Most European Fortune 500 prefer to have on-prem or private cloud solutions due to their security and data privacy policies. This can become a deciding factor in the procurement.
Ease of integration
All of these solutions provide APIs which are easy to integrate into most applications. However, having existing integrations to enterprise software makes integration even easier.
Advanced processing options
Extraction is the first step, in almost all cases companies do additional manual processing on extracted data. For example, invoices need to be assigned to accounts if they are not matched with a purchase order. In such cases, your service provider’s support is important to further automate the process.
This is not a requirement; companies can also work with software companies to build customized solutions that increase their level of automation. However, in areas such as back-office automation, most companies in the same industry have similar data and data does not confer them a competitive advantage. In such cases, companies should strive to get the best solution at the best terms and only companies with experience in the topic can offer such terms.
Support
Most companies in the benchmark set a public claim that they offer extensive support options. Even if they did not publicly claim this, we expect all companies in the field to offer support, especially for large companies so we do not deep dive into this area.
Vendor track record
Similar to support, we have seen that all benchmark companies have Fortune 500 customers. We could get into more details here as we believe we have the strongest network of partners and customers in this space. However, given that Amazon is one of the benchmark companies, this is a hard exercise as it is difficult to split their AWS customers from their Textract customers just based on public data. Therefore, for now, we do not deep dive into this area.
Price
Finally, price is also definitely a factor in decision making. However, given that almost none of the companies in the benchmark set disclose their enterprise prices, we couldn’t compare companies by price.
What are the areas where data extraction solutions fail?
While items like sender and recipient are relatively straightforward, others like line item extraction and multiple VAT rates proved challenging to our competitors.
Line item extraction
Line items, located near the bottom of invoices in a table format, include a list of all items that make up the purchase. They are hard to extract since these table-like structures are not formatted clearly like tables. Some extraction fails from our competitors and Hypatos’ successful results for the same document are below:
Multiple VAT rates
Multiple VAT rates are possible when an invoice contains multiple line items (multiple services or products) with different VAT rates. This was not handled successfully by most competitors. However, Hypatos deep learning tech is able to extract multiple VAT rates correctly.
Hope you find the benchmark useful. Choosing a supplier is hard, hopefully our approach helps you in formulating your own approach. And if you need support in document automation, we would love to help.