What is Personally Identifiable Information (PII) detection in Azure AI Language?
PII detection is one of the features offered by Azure AI Language, a collection of machine learning and AI algorithms in the cloud for developing intelligent applications that involve written language. The PII detection feature can identify, categorize, and redact sensitive information in unstructured text. For example: phone numbers, email addresses, and forms of identification. Azure AI Language supports general text PII redaction, as well as Conversational PII, a specialized model for handling speech transcriptions and the more informal, conversational tone of meeting and call transcripts. The service also supports Native Document PII redaction, where the input and output are structured document files.
- Quickstarts are getting-started instructions to guide you through making requests to the service.
- How-to guides contain instructions for using the service in more specific or customized ways.
- The conceptual articles provide in-depth explanations of the service's functionality and features.
Typical workflow
To use this feature, you submit data for analysis and handle the API output in your application. Analysis is performed as-is, with no added customization to the model used on your data.
Create an Azure AI Language resource, which grants you access to the features offered by Azure AI Language. It generates a password (called a key) and an endpoint URL that you use to authenticate API requests.
Create a request using either the REST API or the client library for C#, Java, JavaScript, and Python. You can also send asynchronous calls with a batch request to combine API requests for multiple features into a single call.
Send the request containing your text data. Your key and endpoint are used for authentication.
Stream or store the response locally.
Native document support
A native document refers to the file format used to create the original document such as Microsoft Word (docx) or a portable document file (pdf). Native document support eliminates the need for text preprocessing prior to using Azure AI Language resource capabilities. Currently, native document support is available for the PiiEntityRecognition capability.
Currently PII supports the following native document formats:
File type | File extension | Description |
---|---|---|
Text | .txt |
An unformatted text document. |
Adobe PDF | .pdf |
A portable document file formatted document. |
Microsoft Word | .docx |
A Microsoft Word document file. |
For more information, see Use native documents for language processing
Get started with PII detection
To use PII detection, you submit text for analysis and handle the API output in your application. Analysis is performed as-is, with no customization to the model used on your data. There are two ways to use PII detection:
Development option | Description |
---|---|
Language studio | Language Studio is a web-based platform that lets you try entity linking with text examples without an Azure account, and your own data when you sign up. For more information, see the Language Studio website or language studio quickstart. |
REST API or Client library (Azure SDK) | Integrate PII detection into your applications using the REST API, or the client library available in various languages. For more information, see the PII detection quickstart. |
Reference documentation and code samples
As you use this feature in your applications, see the following reference documentation and samples for Azure AI Language:
Development option / language | Reference documentation | Samples |
---|---|---|
REST API | REST API documentation | |
C# | C# documentation | C# samples |
Java | Java documentation | Java Samples |
JavaScript | JavaScript documentation | JavaScript samples |
Python | Python documentation | Python samples |
Example scenarios
- Apply sensitivity labels - For example, based on the results from the PII service, a public sensitivity label might be applied to documents where no PII entities are detected. For documents where US addresses and phone numbers are recognized, a confidential label might be applied. A highly confidential label might be used for documents where bank routing numbers are recognized.
- Redact some categories of personal information from documents that get wider circulation - For example, if customer contact records are accessible to frontline support representatives, the company can redact the customer's personal information besides their name from the version of the customer history to preserve the customer's privacy.
- Redact personal information in order to reduce unconscious bias - For example, during a company's resume review process, they can block name, address and phone number to help reduce unconscious gender or other biases.
- Replace personal information in source data for machine learning to reduce unfairness - For example, if you want to remove names that might reveal gender when training a machine learning model, you could use the service to identify them and you could replace them with generic placeholders for model training.
- Remove personal information from call center transcription - For example, if you want to remove names or other PII data that happen between the agent and the customer in a call center scenario. You could use the service to identify and remove them.
- Data cleaning for data science - PII can be used to make the data ready for data scientists and engineers to be able to use these data to train their machine learning models. Redacting the data to make sure that customer data isn't exposed.
Next steps
There are two ways to get started using the entity linking feature:
- Language Studio, which is a web-based platform that enables you to try several Language service features without needing to write code.
- The quickstart article for instructions on making requests to the service using the REST API and client library SDK.