Tutorial: Use Azure Functions and Python to process stored documents

Document Intelligence can be used as part of an automated data processing pipeline built with Azure Functions. This guide will show you how to use Azure Functions to process documents that are uploaded to an Azure blob storage container. This workflow extracts table data from stored documents using the Document Intelligence layout model and saves the table data in a .csv file in Azure. You can then display the data using Microsoft Power BI (not covered here).

Screenshot of Azure Service workflow diagram

In this tutorial, you learn how to:

  • Create an Azure Storage account.
  • Create an Azure Functions project.
  • Extract layout data from uploaded forms.
  • Upload extracted layout data to Azure Storage.

Prerequisites

Create an Azure Storage account

  1. Create a general-purpose v2 Azure Storage account in the Azure portal. If you don't know how to create an Azure storage account with a storage container, follow these quickstarts:

    • Create a storage account. When you create your storage account, select Standard performance in the Instance details > Performance field.
    • Create a container. When you create your container, set Public access level to Container (anonymous read access for containers and files) in the New Container window.
  2. On the left pane, select the Resource sharing (CORS) tab, and remove the existing CORS policy if any exists.

  3. Once your storage account has deployed, create two empty blob storage containers, named input and output.

Create an Azure Functions project

  1. Create a new folder named functions-app to contain the project and choose Select.

  2. Open Visual Studio Code and open the Command Palette (Ctrl+Shift+P). Search for and choose Python:Select Interpreter → choose an installed Python interpreter that is version 3.6.x, 3.7.x, 3.8.x or 3.9.x. This selection will add the Python interpreter path you selected to your project.

  3. Select the Azure logo from the left-navigation pane.

    • You'll see your existing Azure resources in the Resources view.

    • Select the Azure subscription that you're using for this project and below you should see the Azure Function App.

      Screenshot of a list showing your Azure resources in a single, unified view.

  4. Select the Workspace (Local) section located below your listed resources. Select the plus symbol and choose the Create Function button.

    Screenshot showing where to begin creating an Azure function.

  5. When prompted, choose Create new project and navigate to the function-app directory. Choose Select.

  6. You'll be prompted to configure several settings:

    • Select a language → choose Python.

    • Select a Python interpreter to create a virtual environment → select the interpreter you set as the default earlier.

    • Select a template → choose Azure Blob Storage trigger and give the trigger a name or accept the default name. Press Enter to confirm.

    • Select setting → choose ➕Create new local app setting from the dropdown menu.

    • Select subscription → choose your Azure subscription with the storage account you created → select your storage account → then select the name of the storage input container (in this case, input/{name}). Press Enter to confirm.

    • Select how your would like to open your project → choose Open the project in the current window from the dropdown menu.

  7. Once you've completed these steps, VS Code will add a new Azure Function project with a __init__.py Python script. This script will be triggered when a file is uploaded to the input storage container:

import logging

import azure.functions as func


def main(myblob: func.InputStream):
    logging.info(f"Python blob trigger function processed blob \n"
                 f"Name: {myblob.name}\n"
                 f"Blob Size: {myblob.length} bytes")

Test the function

  1. Press F5 to run the basic function. VS Code will prompt you to select a storage account to interface with.

  2. Select the storage account you created and continue.

  3. Open Azure Storage Explorer and upload the sample PDF document to the input container. Then check the VS Code terminal. The script should log that it was triggered by the PDF upload.

    Screenshot of the VS Code terminal after uploading a new document.

  4. Stop the script before continuing.

Add document processing code

Next, you'll add your own code to the Python script to call the Document Intelligence service and parse the uploaded documents using the Document Intelligence layout model.

  1. In VS Code, navigate to the function's requirements.txt file. This file defines the dependencies for your script. Add the following Python packages to the file:

    cryptography
    azure-functions
    azure-storage-blob
    azure-identity
    requests
    pandas
    numpy
    
  2. Then, open the __init__.py script. Add the following import statements:

    import logging
    from azure.storage.blob import BlobServiceClient
    import azure.functions as func
    import json
    import time
    from requests import get, post
    import os
    import requests
    from collections import OrderedDict
    import numpy as np
    import pandas as pd
    
  3. You can leave the generated main function as-is. You'll add your custom code inside this function.

    # This part is automatically generated
    def main(myblob: func.InputStream):
        logging.info(f"Python blob trigger function processed blob \n"
        f"Name: {myblob.name}\n"
        f"Blob Size: {myblob.length} bytes")
    
  4. The following code block calls the Document Intelligence Analyze Layout API on the uploaded document. Fill in your endpoint and key values.

    # This is the call to the Document Intelligence endpoint
        endpoint = r"Your Document Intelligence Endpoint"
        apim_key = "Your Document Intelligence Key"
        post_url = endpoint + "/formrecognizer/v2.1/layout/analyze"
        source = myblob.read()
    
        headers = {
        # Request headers
        'Content-Type': 'application/pdf',
        'Ocp-Apim-Subscription-Key': apim_key,
            }
    
        text1=os.path.basename(myblob.name)
    

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. For more information, see Azure AI services security.

  5. Next, add code to query the service and get the returned data.

    resp = requests.post(url=post_url, data=source, headers=headers)
    
    if resp.status_code != 202:
        print("POST analyze failed:\n%s" % resp.text)
        quit()
    print("POST analyze succeeded:\n%s" % resp.headers)
    get_url = resp.headers["operation-location"]
    
    wait_sec = 25
    
    time.sleep(wait_sec)
    # The layout API is async therefore the wait statement
    
    resp = requests.get(url=get_url, headers={"Ocp-Apim-Subscription-Key": apim_key})
    
    resp_json = json.loads(resp.text)
    
    status = resp_json["status"]
    
    if status == "succeeded":
        print("POST Layout Analysis succeeded:\n%s")
        results = resp_json
    else:
        print("GET Layout results failed:\n%s")
        quit()
    
    results = resp_json
    
    
  6. Add the following code to connect to the Azure Storage output container. Fill in your own values for the storage account name and key. You can get the key on the Access keys tab of your storage resource in the Azure portal.

    # This is the connection to the blob storage, with the Azure Python SDK
        blob_service_client = BlobServiceClient.from_connection_string("DefaultEndpointsProtocol=https;AccountName="Storage Account Name";AccountKey="storage account key";EndpointSuffix=core.chinacloudapi.cn")
        container_client=blob_service_client.get_container_client("output")
    

    The following code parses the returned Document Intelligence response, constructs a .csv file, and uploads it to the output container.

    Important

    You will likely need to edit this code to match the structure of your own documents.

        # The code below extracts the json format into tabular data.
        # Please note that you need to adjust the code below to your form structure.
        # It probably won't work out-of-the-box for your specific form.
        pages = results["analyzeResult"]["pageResults"]
    
        def make_page(p):
            res=[]
            res_table=[]
            y=0
            page = pages[p]
            for tab in page["tables"]:
                for cell in tab["cells"]:
                    res.append(cell)
                    res_table.append(y)
                y=y+1
    
            res_table=pd.DataFrame(res_table)
            res=pd.DataFrame(res)
            res["table_num"]=res_table[0]
            h=res.drop(columns=["boundingBox","elements"])
            h.loc[:,"rownum"]=range(0,len(h))
            num_table=max(h["table_num"])
            return h, num_table, p
    
        h, num_table, p= make_page(0)
    
        for k in range(num_table+1):
            new_table=h[h.table_num==k]
            new_table.loc[:,"rownum"]=range(0,len(new_table))
            row_table=pages[p]["tables"][k]["rows"]
            col_table=pages[p]["tables"][k]["columns"]
            b=np.zeros((row_table,col_table))
            b=pd.DataFrame(b)
            s=0
            for i,j in zip(new_table["rowIndex"],new_table["columnIndex"]):
                b.loc[i,j]=new_table.loc[new_table.loc[s,"rownum"],"text"]
                s=s+1
    
    
  7. Finally, the last block of code uploads the extracted table and text data to your blob storage element.

        # Here is the upload to the blob storage
        tab1_csv=b.to_csv(header=False,index=False,mode='w')
        name1=(os.path.splitext(text1)[0]) +'.csv'
        container_client.upload_blob(name=name1,data=tab1_csv)
    

Run the function

  1. Press F5 to run the function again.

  2. Use Azure Storage Explorer to upload a sample PDF form to the input storage container. This action should trigger the script to run, and you should then see the resulting .csv file (displayed as a table) in the output container.

You can connect this container to Power BI to create rich visualizations of the data it contains.

Next steps

In this tutorial, you learned how to use an Azure Function written in Python to automatically process uploaded PDF documents and output their contents in a more data-friendly format. Next, learn how to use Power BI to display the data.