Protecting RAG Pipelines

RAG systems often copy raw documents into chunk stores, vector indexes, logs, and prompt history. The safest pattern is to detect sensitive values before embedding and store only a redacted or masked version of the content.

Core pattern

from fastpii import PrivacyGuard

guard = PrivacyGuard(regions=["cz"])

document = """
Student: Jan Novak
Rodné číslo: 850101/1234
Address: Vodičkova 10, Praha
"""

result = guard.detect(document)
safe_for_embeddings = guard.redact(document)
safe_for_search = guard.mask(document)

print(result.findings)
print(safe_for_embeddings)
print(safe_for_search)

Use redact() when the index should never expose the original value. Use mask() when users still need partial context during retrieval or debugging.

Pre-index sanitization workflow

Load the raw document.
Run detect() and validate() on suspicious values during ingestion.
Save the original only in a restricted source system.
Store redact() or mask() output in the embedding pipeline.
Retrieve sanitized chunks for downstream prompting.

from fastpii import PrivacyGuard

guard = PrivacyGuard(regions=["cz"])

def prepare_chunk_for_embedding(chunk: str) -> str:
    result = guard.detect(chunk)
    if result.findings:
        return guard.redact(chunk)
    return chunk

FastAPI ingestion endpoint

If you already use the FastPII FastAPI integration, initialize your app with create_app and keep the sanitization logic inside the document ingestion route.

from fastapi import FastAPI
from pydantic import BaseModel

from fastpii import PrivacyGuard
from fastpii.integrations.fastapi import create_app

guard = PrivacyGuard(regions=["cz"])
app: FastAPI = create_app()


class IngestRequest(BaseModel):
    document_id: str
    content: str


@app.post("/ingest")
def ingest_document(payload: IngestRequest):
    sanitized = guard.redact(payload.content)
    result = guard.detect(payload.content)

    # Persist sanitized to your chunk store / vector pipeline.
    return {
        "document_id": payload.document_id,
        "findings_count": len(result.findings),
        "sanitized_content": sanitized,
    }

LangChain indexing pattern

FastPII can sit between your loader and your embedder.

from fastpii import PrivacyGuard

guard = PrivacyGuard(regions=["cz"])


def sanitize_documents(documents: list[str]) -> list[str]:
    return [guard.redact(doc) for doc in documents]


raw_documents = [
    "Customer Jan Novak, rodné číslo 850101/1234, requested support.",
    "Billing contact: jana.novak@example.com",
]

documents_for_embeddings = sanitize_documents(raw_documents)

Recommended operating model

Keep raw documents outside the vector store.
Embed only sanitized text.
Log detect() counts for auditability.
Prefer redact() for shared indexes and mask() for internal retrieval systems.

Protecting RAG Pipelines

Protecting RAG Pipelines

Core pattern

Pre-index sanitization workflow

FastAPI ingestion endpoint

LangChain indexing pattern

Recommended operating model

On this page