Control 1.13: Sensitive Information Types (SITs) and Pattern Recognition
Control ID: 1.13
Pillar: Security
Regulatory Reference: FINRA 4511, FINRA 25-07, SEC 17a-3, SEC 17a-4, SEC Reg S-P, GLBA 501(b), SOX 404, PCI-DSS, CFTC 1.31 (swap dealers), OCC 2011-12 / Fed SR 11-7 (model risk for trainable classifiers)
Last UI Verified: April 2026
Governance Levels: Baseline / Recommended / Regulated
Objective
Configure Sensitive Information Types (SITs) for automatic detection and classification of financial data including customer NPI (SSN, account numbers), regulatory identifiers (CRD, CUSIP), and material non-public information (MNPI) to enable DLP policies and AI data governance.
Why This Matters for FSI
- GLBA 501(b): Enables automatic detection and protection of customer NPI
- SEC Reg S-P (2024 Amendments): Identifies customer information (broadened to include NPI from other financial institutions) requiring safeguards, incident response programs, and 30-day breach notification
- FINRA 4511: Classifies records containing customer information for retention
- FINRA 25-07: SIT-based classification supports required governance policies for privacy, integrity, and accuracy of data processed by AI tools
- PCI-DSS: Detects credit card numbers in scope systems
- SOX 404: Identifies financial reporting data for integrity controls
No companion solution by design
Not all controls have a companion solution in FSI-AgentGov-Solutions; solution mapping is selective by design. This control is operated via native Microsoft admin surfaces and verified by the framework's assessment-engine collectors. See the Solutions Index for the catalog and coverage scope.
Control Description
This control establishes Sensitive Information Types and adjacent classification primitives as the foundation of data classification in Microsoft 365. Microsoft Purview treats these as distinct detection paths, not a single feature, and FSI deployments must reason about them separately:
- Built-in pattern SITs — Microsoft-provided regex/checksum/proximity detectors (e.g., U.S. Social Security Number, Credit Card Number with Luhn, U.S. Bank Account Number, ABA Routing Number, ITIN, EIN, CUSIP). Available with base Purview entitlement.
- Custom pattern SITs — Organization-specific detectors built from primary elements, supporting elements, regex (with optional validators / checksums), keyword lists or dictionaries, proximity, and Low / Medium / High confidence levels. Examples relevant to FSI include internal account-number formats, FINRA CRD Number (custom — no checksum exists), and MNPI indicator keyword-dictionary SITs.
- Named Entity Recognition (NER) SITs — Bundled and unbundled named-entity detectors (e.g., "All full names", "All physical addresses", medical conditions). NER is used inside DLP, sensitivity labels, retention, eDiscovery, and DSPM-for-AI signals. Requires Microsoft 365 E5 / E5 Compliance entitlement.
- Trainable classifiers — ML-based complement to pattern SITs. Useful where regex cannot reasonably express intent (e.g., MNPI, source code, customer complaints). See the FSI governance gate below before relying on a trainable classifier as a primary control.
- Exact Data Match (EDM) — Detection against a one-way salted hash of customer master data uploaded via the EDM Upload Agent. Required where deterministic match against actual customer records is needed (e.g., specific account numbers, beneficiary records).
- Keyword dictionaries — Reusable term lists referenced by SITs (e.g., competitor names, restricted entities, project codenames).
These primitives are then consumed by DLP (including DLP for Microsoft 365 Copilot), sensitivity labels, retention policies, Communication Compliance, eDiscovery, and DSPM for AI to deliver the protections required under FINRA 4511, SEC Reg S-P, GLBA 501(b), and FINRA 25-07.
License & Entitlement
| Capability | Minimum License (verify per release) |
|---|---|
| Built-in & custom pattern SITs in DLP / labels / retention | Microsoft Purview (base) |
| Custom SIT keyword dictionaries | Microsoft Purview (base) |
| Credential-scanning SITs | Microsoft 365 E5 / E5 Compliance |
| Named Entity Recognition (NER) — bundled & unbundled | Microsoft 365 E5 / E5 Compliance |
| Exact Data Match (EDM) — production use | Microsoft 365 E5 / E5 Compliance |
| Trainable classifiers (custom) | Microsoft 365 E5 / E5 Compliance |
| DSPM for AI consumption of SITs | DSPM for AI entitlement (per current Microsoft licensing) |
| DLP for Microsoft 365 Copilot using SIT conditions | Microsoft 365 E5 / E5 Compliance + Copilot entitlement (verify SIT-based prompt-restriction GA vs preview at publication time) |
| Communication Compliance use of SITs | Communication Compliance (E5 / add-on) |
| eDiscovery Premium use of SITs in conditions | E5 Compliance / eDiscovery Premium add-on |
Re-verify against the Microsoft 365 security & compliance licensing guidance before signing customer entitlements.
Sovereign Cloud Availability
| Capability | Commercial | GCC | GCC High | DoD |
|---|---|---|---|---|
| Built-in & custom pattern SITs | GA | GA | GA | GA |
| Exact Data Match (EDM) | GA | Verify per release | Verify per release | Verify per release |
| Named Entity Recognition (NER) | GA | Verify per release | Verify per release | Verify per release |
| Trainable classifiers (built-in / custom) | GA / Preview varies by classifier | Often lagging | Often lagging | Often lagging |
| DLP for Microsoft 365 Copilot | GA / Preview varies by feature | Verify per release | Verify per release | Verify per release |
| DSPM for AI | GA | Verify per release | Verify per release | Verify per release |
FSI broker-dealer and federal-adjacent advisory tenants on GCC / GCC High / DoD must re-verify against the Microsoft 365 government service description before relying on EDM, NER, trainable classifiers, DLP-for-Copilot, or DSPM for AI in production. Treat any cross-cloud parity gap as a compensating-control conversation, not an assumption of feature parity.
Trainable Classifier — FSI Governance Gate
Before relying on a trainable classifier (built-in or custom) as a primary detection control:
- Confirm the classifier is GA, not Preview, for the targeted workload.
- Sample and document training-data provenance, quality, and review per OCC 2011-12 / Fed SR 11-7 model-risk expectations.
- Do not use a trainable classifier as the sole evidence for record-classification under SEC 17a-4 / FINRA 4511 — pair with a deterministic SIT or sensitivity label so the auditable record has a reproducible match.
- Re-validate accuracy on a defined cadence (recommended quarterly for Zone 3) and capture the validation in the SIT/classifier evidence file.
- Custom trainable classifiers are English-only at GA — confirm regional language coverage before scoping multilingual customer datasets.
Copilot / Agent Integration
SITs do not protect Microsoft 365 Copilot or agent grounding directly — they are the detection primitive that feeds the policies that do:
- DLP for Microsoft 365 Copilot — DLP policies scoped to the Copilot workload use SIT conditions to exclude items from Copilot processing or restrict prompts that contain matched SITs. Required for FINRA 25-07 governance of AI data inputs. Confirm SIT-based prompt restriction GA vs preview status at publication time.
- Sensitivity labels — Auto-labeling rules driven by SITs can mark items as excluded from Copilot processing, which is the GA-level integration today.
- DSPM for AI — Surfaces "sensitive data shared with Copilot" telemetry using the same SITs. Tuning a SIT in this control directly affects DSPM-for-AI signal quality. Note that Microsoft is consolidating the classic AI Hub experience into a unified DSPM experience — reference the capability generically rather than hard-coding a portal name.
- Agent grounding (Copilot Studio, Microsoft 365 Agents) — Agents inherit DLP-for-Copilot exclusions when configured under a governed environment (see Control 2.x — Environment Governance) and when the agent's grounding sources are SharePoint / Exchange / OneDrive content that DLP can evaluate.
A SIT false negative (e.g., SSN not detected) can result in customer NPI surfaced through Copilot output. Treat such failures as reportable events — see the Incident Handling cross-reference in Related Controls and the §1 FSI Incident Handling section of the troubleshooting playbook.
Key Configuration Points
- Review and enable the built-in financial / PII SITs in Purview > Data classification > Sensitive info types: U.S. Social Security Number, Credit Card Number (Luhn-validated), U.S. Bank Account Number, ABA Routing Number, U.S. Individual Taxpayer Identification Number (ITIN), U.S. Employer Identification Number (EIN), CUSIP, and the bundled NER SITs ("All full names", "All physical addresses") where E5 entitlement permits.
- Create custom pattern SITs for internal account-number formats — use anchored, scoped regex (avoid bare
^/$and unbounded wildcards), rely on supporting evidence (keywords, proximity, dictionaries), and pin Low / Medium / High confidence to the SIT's actual numeric thresholds (which vary per SIT). - Create the FINRA CRD Number SIT as a custom SIT — CRD numbers have no published checksum, so confidence tuning relies entirely on regex precision and keyword proximity. Set initial confidence to Medium, monitor false positives weekly during pilot, and never rely on CRD detection alone for blocking actions.
- Build the MNPI indicator SIT from a curated keyword dictionary plus deal-codename / project-name supporting elements; pair it with a trainable classifier only after the FSI Governance Gate above is met.
- Configure EDM for high-value customer master data: define schema (≤32 searchable columns), salt and hash the source dataset with the EDM Upload Agent, schedule daily refresh, and verify the rule package shows Active before relying on it for a DLP block action. Re-verify limits in Learn about EDM-based SITs at publication.
- Set DLP / label confidence by label (Low / Medium / High) per use case: High for block actions, Medium for alert / educate, Low only for diagnostic policies. Underlying numeric thresholds (e.g., 85 / 75 / 65) are per-SIT — open the SIT's entity definition before tuning.
- Test every SIT in Test pane → Content Explorer → DLP simulation → DLP-for-Copilot simulation before promoting to enforce. Capture seeded synthetic test data and the resulting matches as evidence (do not seed real customer NPI into test corpora).
- Maintain a false-positive / false-negative review cadence (recommended monthly for Zone 3) and route findings into the SIT evidence file.
Zone-Specific Requirements
| Zone | Requirement | Rationale |
|---|---|---|
| Zone 1 (Personal) | Built-in SITs only; alert-only DLP; High confidence threshold | Low risk, awareness-focused |
| Zone 2 (Team) | Built-in + foundational custom SITs; alert and educate; Medium–High confidence | Team data requires classification |
| Zone 3 (Enterprise) | Full SIT library + EDM + NER (where licensed) + governed trainable classifiers; block on High confidence; alert on Medium; DLP-for-Copilot and DSPM-for-AI integration required; sovereign-cloud parity confirmed for every primitive in scope | Customer-facing data requires comprehensive detection and AI-grounding controls |
Roles & Responsibilities
| Role | Responsibility |
|---|---|
| Purview Info Protection Admin | Create and manage SITs, tune confidence levels, deploy EDM rule packages |
| Purview Compliance Admin | Approve custom SIT definitions, review detections in Content Explorer / Activity Explorer |
| Purview Data Security AI Admin | Validate SIT coverage feeding DSPM for AI signals and DLP-for-Copilot policies |
| AI Administrator | Coordinate Copilot feature controls that depend on SIT-driven sensitivity labels |
| Compliance Officer | Validate regulatory mapping (FINRA 4511 / SEC 17a-3/4 / Reg S-P / GLBA) and accept residual risk |
| Legal (organizational function) | Review MNPI and privileged-information SIT definitions and keyword dictionaries |
Related Controls
| Control | Relationship |
|---|---|
| 1.5 - DLP and Sensitivity Labels | SITs are used in DLP policy conditions and as auto-labeling triggers |
| 1.6 - DSPM for AI | SIT detections drive AI data exposure monitoring; SIT false negatives degrade DSPM-for-AI signal quality |
| 1.7 - Audit Logging | SIT detection events surface in Unified Audit Log and Activity Explorer; required for SEC 17a-4 / FINRA 4511 evidence |
| 1.10 - Communication Compliance | SITs used in communication review policies for MNPI, suitability, and customer-NPI conditions |
| Incident Response Playbook (AI) | SIT false negatives that surface customer NPI through Copilot / agent grounding may trigger Reg S-P 30-day breach notification — triage as reportable events |
Implementation Playbooks
Step-by-Step Implementation
This control has detailed playbooks for implementation, automation, testing, and troubleshooting:
- Portal Walkthrough — Step-by-step portal configuration
- PowerShell Setup — Automation scripts
- Verification & Testing — Test cases and evidence collection
- Troubleshooting — Common issues and resolutions
Verification Criteria
Confirm control effectiveness by verifying:
- Built-in financial & PII SITs (SSN, Credit Card Number, U.S. Bank Account Number, ABA Routing Number, ITIN, EIN, CUSIP) appear in Purview > Data classification > Sensitive info types with the expected publisher.
- Custom SITs (e.g., FINRA CRD, internal account formats, MNPI keyword-dictionary SIT) appear with the documented publisher and confidence levels.
- Seeded synthetic test document is detected in Content Explorer within the documented telemetry window (typically up to 24 hours; verify against your tenant baseline rather than assuming a fixed SLA).
- Test DLP rule fires on the seeded SIT match across Exchange, SharePoint, OneDrive, and Teams workloads.
- DLP-for-Copilot policy containing the SIT correctly excludes / redacts the seeded item from Copilot grounding (verify with a controlled prompt and capture both input and output as evidence).
- DSPM for AI dashboard registers the SIT detection under "Sensitive data shared with Copilot" telemetry within the documented window.
- EDM rule package shows Active; hash refresh executed within the agreed cadence (typically daily); seeded exact-match record detected.
- NER-backed SITs (e.g., "All full names") return matches on a seeded NER test document (E5 license required).
- Sovereign-cloud features in use (EDM / NER / trainable classifier / DLP-for-Copilot / DSPM for AI) are confirmed available in the tenant's cloud (Commercial / GCC / GCC High / DoD) — capture the verification evidence per release.
- False-positive / false-negative review log maintained on a defined cadence (recommended monthly for Zone 3); findings routed back into SIT tuning and DLP policy adjustments.
- Trainable classifiers in use have current model-risk validation evidence per OCC 2011-12 / Fed SR 11-7 if they drive a primary control.
Additional Resources
- Microsoft Learn: Sensitive Information Types Overview
- Microsoft Learn: SIT entity definitions (full catalog of built-in SITs)
- Microsoft Learn: Create Custom SITs
- Microsoft Learn: Keyword Dictionaries
- Microsoft Learn: Named Entities
- Microsoft Learn: Trainable classifiers — learn about
- Microsoft Learn: Trainable classifiers — built-in definitions
- Microsoft Learn: Exact Data Match (EDM) — overview
- Microsoft Learn: Exact Data Match — unified UX schema & rule package
- Microsoft Learn: DLP for Microsoft 365 Copilot location
- Microsoft Learn: DSPM for AI overview
Updated: April 2026 | Version: v1.4.0 | UI Verification Status: Current