Harnessing the Power of SLMs for Context-Aware PII Detection

7th Jan 2025 | Aamir Faaiz

In the age of data-driven innovation, safeguarding Personally Identifiable Information (PII) is a critical responsibility for organisations leveraging AI tools. The challenge of ensuring privacy compliance becomes even more complex when sharing data with external large language models (LLMs). One promising solution to this problem is the use of Small Language Models (SLMs) equipped with context-aware PII detection capabilities. This article explores how SLMs can serve as efficient data anonymisation assistants through prompt engineering techniques, ensuring data privacy without sacrificing performance.

1. What Are Small Language Models (SLMs)?

SLMs are lightweight language models designed to operate efficiently with limited computational resources. Unlike LLMs, which require extensive GPU clusters, SLMs can often run on edge devices or local servers, offering lower latency and enhanced data security. Their smaller size makes them ideal for tasks like PII detection and redaction, especially in scenarios where real-time processing or resource constraints are crucial.

2. Context-Aware PII Detection with SLMs

PII detection often requires contextual understanding to distinguish between sensitive and non-sensitive information. For instance, the name “John” might be PII in a sentence like "John's phone number is 123-456-7890," but not in "John is a common name." SLMs can be fine-tuned or configured with prompt engineering to provide this nuanced detection.

A sample prompt engineering approach for PII detection might look like:

you are a data anonymization assistant. Identify and redact all PII from the following text:
"John Doe, born on July 4, 1980, 
lives at 123 Main Street, Springfield, 
and can be contacted at john.doe@example.com or 555-123-4567."

The SLM’s output:

"[REDACTED NAME], born on [REDACTED DATE], 
lives at [REDACTED ADDRESS], 
and can be contacted at [REDACTED EMAIL] or [REDACTED PHONE NUMBER]."

3. Implementing SLMs for PII Anonymisation

Here’s a Python-based implementation of an SLM pipeline for context-aware PII detection using the Hugging Face Transformers library:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def anonymize_text(input_text, model_name="t5-small"):
    """
    Anonymize PII using a small language model.
    
    Args:
        input_text (str): The text to be anonymized.
        model_name (str): Hugging Face model identifier.

    Returns:
        str: Anonymized text.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    prompt = (
        "You are a data anonymization assistant. Anonymize the following text:\n"
        f"{input_text}"
    )
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)
    anonymized_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return anonymized_text

# Example usage
text = (
    "John Doe, born on July 4, 1980, lives at 123 Main Street, Springfield, and can be contacted at "
    "john.doe@example.com or 555-123-4567."
)

anonymized_result = anonymize_text(text)
print("Anonymized Text:", anonymized_result)

This (simplified) implementation demonstrates how to use prompt engineering with SLMs to anonymise sensitive data efficiently.

4. Use Case: Healthcare Data Sharing

Consider a healthcare organisation sharing patient records with an external AI vendor to analyse treatment effectiveness. Patient data often contains PII like names, dates of birth, and addresses, which need anonymisation before being processed externally.

An anonymisation pipeline using SLMs could:

  1. Ingest Data: Parse patient records from structured or unstructured sources.
  2. Detect and Redact PII: Use SLMs to identify and replace PII with placeholders.
  3. Output Anonymised Data: Forward the anonymised data to the external AI system for analysis.

For example:

Original Record:
"Patient: Jane Smith, DOB: 1990-05-12, Address: 456 Elm St, Contact: 555-987-6543."

Anonymized Output:
"Patient: [REDACTED NAME], DOB: [REDACTED DATE], Address: [REDACTED ADDRESS], Contact: [REDACTED PHONE NUMBER]."

This approach ensures compliance with regulations like GDPR and HIPAA while enabling data-driven insights.

5. Challenges and Future Directions

While SLMs offer several advantages, they are not without limitations:

  1. Accuracy: SLMs may struggle with ambiguous contexts or novel PII formats.
  2. Integration Complexity: Building seamless anonymisation pipelines requires robust orchestration.
  3. Limited Knowledge: SLMs may lack the extensive training data and knowledge base of LLMs.

Small Language Models represent a promising avenue for context-aware PII detection, offering a balance between performance, efficiency, and privacy. By leveraging prompt engineering, organisations can deploy SLMs as data anonymisation assistants, ensuring compliance with privacy regulations while unlocking the value of sensitive data.

Whether in healthcare, finance, or customer support, SLMs provide a scalable and secure solution for safeguarding PII in an increasingly AI-driven world.

Interested in implementing AI agents for your business?

Get in touch with Bayseian for tailored solutions