Smart OCR – Dietary constraints detection


Hello friends,

This tutorial we will start getting our hands dirty – That is, we will begin developing a real AI software that solves a genuine business problem so that you feel both learning and developing something that has a value proposition.

Let me start with a frequent problem I face; As a Muslim, I have dietary constraints with regards to food components and E numbers. I spend considerable time in the supermarket attempting to translate words and look them up. It would be convenient for me (and for anyone with dietary constraints due to religious, medical and ethical reasons) to have a mobile app that you use to scan the product, and it immediately tells you if there are dietary constraints.

Scoping the business problem

To make the expectations bar clear, I am not planning to implement a fully working E2E, the main focus here will be implementing the cognitive service that enable the core functionality. I will share the source code in GitHub, and anyone is more than welcome to expand it and develop it further.

Therefore, our solution will rely on cognitive APIs:

OCR detection API: This API will be responsible of returning text extracted from an image. We will use it to extract the food components and get the text so that we can match it against an existing database.

Vendor choice

As I discussed in my last blog post (Democratized AI – AI as a Service for all!) there are many platforms providing AIaaS such as GCP, AWS, and Azure. For this tutorial I am choosing Microsoft Azure offering, feel free to select Amazon or Google. Make sure to compare prices, supported languages, orientation and any parameter that could be of significance to you.

Preparing your Azure Account

First and foremost, make sure you have a valid Azure account, follow this tutorial from Microsoft to have your Azure account ready, and have your cognitive service ready.

Preparing your project

I will be using Visual Studio 2017 Enterprise edition as an IDE to develop the code sample. The code samples are written by C# and ASP.NET Core 2.2. Kindly note that you are entirely free to use your own preferred programming language/framework, the APIs are platform agnostic as discussed previously. Our implementation will rely on the .NET SDK, other SDKs can be found here. I will provide the URL to the GitHub project containing the code sample. The core logic of the functionality relies on a code sample from Microsoft MSDN, with adaptions to meet our requirements.

Developing the solution

  1. Preparing the restriction list:

First, I created a simple static class that has a property containing all restricted food components:

I have chosen (thiamine) as a test sample, feel free to add any if you wish.

  • Defining the OCR Service:

Even though the focus of this blog post, is developing dietary constraints detection solution in the fastest way, I am trying to keep things a little bit cleaner.

  1. OCR Service interface:

The interface is pretty simple, it takes an image stream (image of ingredients part of the product) and returns back a task that returns a string array, the string array will contain all detected text lines.

  • The following NuGet package has to be added to the project (Microsoft.Azure.CognitiveServices.Vision.ComputerVision)

This is the computer vision SDK from Microsoft encapsulating HTTP calls to the cognitive services.

  • Azure OCR Service implementation:

AzureOCRService class is the concrete implementation of IOCRService, the class defines the following properties:

numberOfCharsInOperationId: This is a fixed value of 36, it merely refers to the length of (Operation Id) parameter returned by cognitive services resource from Azure as part of (Operation Location)

subscriptionKey: This is an API Key that is used to authenticate requests towards cognitive services resource in the Azure portal, you can find it in the keys part when you choose your cognitive services resource.

cognitiveServiceEndPoint: This refers to the URL of the endpoint where your cognitive service is hosted. You can find it in the overview part of your cognitive service in the Azure portal.

textRecognitionMode: An enum that accepts (HANDWRITTEN) or (PRINTED), it tells whether we want to detect printed or handwritten text.

computerVision: The computer vision client used to perform API calls, it wraps all HTTP requests handily to perform the computer vision cognitive operations.

After defining the major class properties, let’s examine the class body and discuss the used functions.

Our constructor, which prepares the computer vision client subscription and endpoint details.

The interface implementation from IOCRService, it simply wraps a call to ExtractLocalTextAsync.The function accepts a stream that contains the image to be text detected.

The ExtractLocalTextAsync function performs the following:

  1. Calls RecognizeTextInStreamAsync from computerVision, RecognizeTextInStreamAsync is an async function that accepts image stream containing the image and TextRecognitionMode. The function returns RecognizeTextInStreamHeaders which is an object that simply wraps a unique identifier for the operation valid for 48 hours, it can be used to query operation status. It stores it in OperationLocation property, the operation id itself is 36 characters long.

Sample format, where operation id is bolded:

https://francecentral.api.cognitive.microsoft.com/vision/v2.0/textOperations/7db64783-013a-4e57-9db1-f5ef17ed517a

  • Calls GetTextAsync passing computerVision and operation location.
  1. Extracts operation Id from operation location.
  2. Uses computer vision client to query the status of the extracted operation Id by calling (GetTextOperationResultAsync), which returns TextOperationResult, which has the status property that can be: ‘Not Started,’ ‘Running,’ ‘Failed’ or ‘Succeeded.’
  3. Waits for the operation to finish with a certain number of retries, when succeeds returns all detected lines from the TextOperationResult. The TextOperationResult also has word and bounding text properties which can be used to extract individual words.

Finally, our controller in ASP.NET Core 2.2 injects the OCRService and calls DetectTextInImageAsync operation and matches it against the restriction list we defined to render the detected restricted words from the image. I am not putting the code details on ASP.NET Core part as it is not relevant. However, you can find everything in the code sample in the following URL https://github.com/cognitiveosman/DietrayRestrictionsDetector 😊

Here is a screenshot of how our application looks like:

Feel free to expand upon it and innovate, submit GitHub PR if you wish too. Here are some ideas I have:

  1. Optimize the application UI, highlight particular restricted words in the image.
  2. Make the application functionality as API.
  3. Develop mobile front-end.
  4. Redesign the restrictions list to support multiple languages, explain restriction cause.

Discussion Questions:

  1. Do you feel that you face a repeated similar problem?
  2. Is there something that you feel is too dummy to do manually every time?
  3. Can you think of other applications where we utilize text extraction services?  

And thanks for reading. Please feel free to ask questions or share feedback.