Why Your Employees Aren't Reporting Security Threats and What to Build Instead

Abstract

We spend $188.3 billion annually on cybersecurity. Most of that money goes toward detection systems. Almost none goes toward making it easier for the humans inside our organizations to tell us what they see. This is a mistake. I built a framework that processes informal security reports—the kind employees actually write—using off-the-shelf NLP. Nothing fancy. SpaCy for entity extraction, a fine-tuned BERT variant for classification, cosine similarity at 0.85 threshold for deduplication. The preliminary numbers look promising: 2.7 minutes from intake to scored ticket versus the 27-minute industry average I measured across three SOCs in 2023. But I have been wrong before, and real validation requires production deployment. This paper describes the architecture, explains my reasoning, and is honest about what remains unproven. Keywords: vulnerability intake, NLP triage, contextual risk scoring, SOC automation

1. The Reporting Friction Nobody Talks About

January 2023. A finance manager at a manufacturing client—I will call her Diana—noticed something off. A contractor kept asking questions about their ERP payment modules. Not overtly suspicious questions. Just... persistent ones. The kind that make your gut clench even when you cannot articulate why. Diana wanted to report it. She pulled up the company's security portal. It asked her to select a threat category from 47 options. Forty-seven. She did not know if this was "Insider Threat (Malicious)" or "Insider Threat (Negligent)" or "Third-Party Risk" or "Social Engineering" or something else entirely. The form demanded she estimate "potential financial impact" and provide "technical indicators of compromise." She is an accountant. She closed the tab. Nineteen days later, that contractor's credentials appeared in a fraudulent wire transfer attempt. $340,000. Caught by the bank, thankfully. But Diana's early warning never made it to security. The silence gap—my term for the time between observation and action—stretched to permanent silence. This is not an unusual story. It is the norm.

1.1 The Numbers Are Damning

Verizon's 2024 DBIR puts human involvement at 82% of breaches [1]. Ponemon measured the average breach cost at $4.88 million in 2024, with detection taking an average of 194 days [2]. But those statistics obscure a more specific problem: how many incidents had early warning signs that employees noticed but never reported? Nobody tracks this. I tried to find the data. It does not exist. So I ran my own informal survey across 127 employees at four organizations in late 2023. Sixty-three percent said they had noticed something "possibly suspicious" in the past year. Of those, only 31% actually reported it. The top reason for not reporting? "The process was too complicated" at 44%. "Did not think it was serious enough" came second at 29%. "Did not know how" was third at 18%. We are bleeding intelligence because our intake systems are hostile to the people we need information from.

1.2 Training Is Not the Answer

The standard response to human-factor security problems is more training. Phishing simulations. Annual awareness courses. Posters. I have sat through dozens of these programs. They help. A little. But training addresses motivation, not friction. Diana was motivated. She opened the portal. The portal defeated her. A perfectly trained, highly engaged employee still hits a wall when the reporting mechanism itself is the obstacle. We cannot train our way out of a UX problem.

2. What Scanners Cannot See

Traditional vulnerability management operates in a scan-patch-repeat loop. Nessus identifies CVE-2024-3094 on 47 Linux boxes. Qualys flags SSL certificates expiring in 30 days. Rapid7 tracks remediation progress. These tools excel at their jobs [3, 18, 19]. I am not criticizing them. But scanners only see technical vulnerabilities on known assets. They cannot see the sticky note with "Summer2024!" written on a monitor in accounting. They cannot detect that a terminated employee's badge still works because HR forgot to notify physical security. They cannot flag that a vendor is asking questions outside their project scope. Humans notice these things. Scanners do not. We have built elaborate automated systems to find the vulnerabilities machines can detect. We have done almost nothing to capture the vulnerabilities only humans can observe.

2.1 The CVSS Blind Spot

CVSS gives us consistent severity ratings [4]. Useful. But a 9.8 critical on an air-gapped test VM is not actually critical. A 6.5 medium on a public-facing payment gateway might be urgent. CVSS measures intrinsic severity, not organizational risk. The distinction matters. I watched a SOC team spend three days patching a "critical" vulnerability on development servers that had no network connectivity to production. Same week, a "high" severity issue on their customer database sat in the queue because the CVSS score was lower. Prioritization by CVSS alone is prioritization by the wrong variable.

2.2 The Literature Gap

I spent six weeks reviewing academic literature on AI in cybersecurity before starting this project. Plenty of work on intrusion detection using deep learning [5, 9]. Malware classification papers everywhere. Behavioral analytics research is growing. NLP applied to vulnerability intake? Almost nothing. CISA publishes CVD frameworks [6]. ISO 30111 covers handling processes [7]. These define procedures. They do not solve the technology problem of processing 200 vaguely-written employee reports during a phishing campaign. This gap surprised me. The tools exist. BERT is six years old [14]. SpaCy is mature. Why has nobody assembled them for this use case?

3. Architecture: What I Built

3.1 Design Constraints

Three rules governed my design. First: go where users already are. Email. Slack. Teams. If someone has to learn a new tool, adoption dies. Second: explain every decision. Security analysts distrust black boxes—rightfully so. If the system assigns a risk score, it must show its work. Third: assist humans, do not replace them. The goal is faster triage, not fewer analysts.

3.2 The Four Layers

Intake. Reports arrive via email, Slack webhook, Teams integration, web form, direct API. Does not matter which. Everything normalizes to a common JSON schema before moving downstream. Basic sanitization happens here—stripping potential injection payloads, validating sender identity, logging metadata for audit trails. NLP Processing. Named Entity Recognition extracts the relevant pieces: what asset, what threat type, who is involved. This is harder than it sounds. "The finance laptop" and "Maria's Dell" and "that computer by the window in accounting" might all reference the same machine. The NER model needs training data that reflects how employees actually write, not how security professionals wish they would write. Deduplication runs here too. During a company-wide phishing attempt, fifty employees might report the same email within an hour. Without deduplication, that is fifty tickets for one incident. I generate vector embeddings for each report, compare using cosine similarity, and link reports exceeding 0.85 similarity rather than creating duplicates. Risk Scoring. This is where organizational context enters. Raw severity alone is insufficient. The scoring function weighs baseline severity, active exploitation status from CISA KEV, and asset criticality from the organization's CMDB. Details in section 3.3. Workflow. Scored tickets route to appropriate analysts based on configurable rules. Critical items trigger immediate notifications. Everything logs for compliance. Standard stuff.

3.3 The Scoring Function

I spent more time on this than any other component. The formula: R = (S × 0.30) + (E × 0.25) + (A × 0.35) + (C × 0.10)S is severity, normalized 0-10. When a CVE is identified, this pulls directly from CVSS base score. For behavioral reports without CVEs, the NLP classifier estimates severity based on threat taxonomy. E is exploitation status. Binary. If the vulnerability appears in CISA's Known Exploited Vulnerabilities catalog, E = 2.0. Otherwise E = 1.0. Simple but effective—active exploitation should double the urgency. A is asset criticality, rated 1-5 from the organization's asset inventory. A customer-facing payment processor is 5. A developer's local test VM is 1. This variable carries the highest weight (0.35) deliberately. Contrary to common practice, I believe organizational context matters more than intrinsic severity. C is confidence, ranging 0.5-1.0 based on how certain the NLP engine is about its extractions. Clear reports with explicit asset references score high. Vague "something seems weird" reports score low and route to human review rather than automated triage. Why these specific weights? Trial and error against historical data from two organizations that gave me anonymized access. The weights are configurable. Different risk appetites demand different balances.

4. Implementation Details

4.1 Stack Choices

Python 3.11. FastAPI over Flask—native async matters when 200 reports hit the API simultaneously during an incident. SpaCy 3.7 for tokenization and base NER. A BERT-base-uncased model fine-tuned on 12,000 labeled security reports for classification. PostgreSQL for structured ticket data. MongoDB for raw inputs where schema flexibility helps with retraining later. Nothing exotic. Deliberately so. I wanted components any mid-sized security team could deploy and maintain without hiring ML specialists.

4.2 The 0.85 Question

People ask about the deduplication threshold. Why 0.85? At 0.80, I saw too many false positives—distinct issues getting conflated. At 0.90, obvious duplicates slipped through. 0.85 minimized both error types on my test set of 3,400 report pairs. But this needs per-organization tuning. A company with terse reporting culture might need 0.82. One with verbose reporters might need 0.88. Plan to monitor and adjust.

4.3 Integration Pain

I need to be honest about dependencies. The asset criticality lookup requires a functional CMDB with accurate ratings. The threat intelligence correlation needs CISA KEV access and ideally MITRE ATT&CK mappings. If your asset inventory is a mess—and I have seen plenty where it is—contextual scoring degrades badly. Garbage in, garbage out. This architecture cannot fix upstream data problems. Organizations with immature asset management should address that first or expect diminished results.

5. Preliminary Results and Planned Validation

5.1 What I Have Measured (Limited)

Bench testing against 8,200 historical reports from two partner organizations. Processing time: median 2.7 minutes from intake to scored ticket. Manual baseline I measured at those same orgs: median 27 minutes. That is a 90% reduction. Accuracy: 83% of auto-generated tickets required no significant analyst correction. Deduplication precision: 91% on labeled test set. These numbers look good. I do not fully trust them. Bench testing against historical data is not production deployment. The real test comes when actual employees submit actual reports in actual operational conditions.

5.2 Planned Validation

Phase one: shadow deployment at a mid-sized financial services firm, Q2 2025. The system will run parallel to existing manual triage. Automated assessments generated but not acted upon. After 90 days, compare automated scores against analyst determinations. Target: 85% agreement rate. Phase two: pilot deployment with analyst discretion. Automated routing with human override capability. Measure time-to-triage, analyst satisfaction, false positive rates over six months. I will publish results regardless of outcome. Negative findings are findings.

6. Risks and Honest Limitations

6.1 The Trust Barrier

Security analysts are professional skeptics. They will not trust risk scores from a system they did not build. Transparency helps—every score includes a justification string explaining contributing factors. But real trust requires demonstrated accuracy over time. Shadow mode deployments are not optional; they are prerequisites.

6.2 Adversarial Manipulation

A malicious insider could craft reports to game the system. Suppress legitimate concerns by writing them vaguely enough to score low. Manufacture urgency with alarming but false reports. Defense requires anomaly detection on reporting patterns, input sanitization, and preserved human oversight for high-stakes decisions. I am not claiming this system is manipulation-proof. No system is.

6.3 What Remains Unproven

NLP performance on curated test sets may not match performance on actual employee submissions. Integration complexity at scale is unknown. User adoption outside controlled pilots is uncertain. Long-term model drift without continuous retraining is unmeasured. This paper presents an architecture and preliminary bench results. It does not present validated production outcomes. The distinction matters. I am optimistic. I am not certain.

7. Conclusion

Diana's observation about that contractor never reached security. The 47-dropdown form defeated her. That form exists because we designed our systems for our convenience, not for the humans we need information from. The framework in this paper attempts to fix that specific failure. Accept reports through channels employees already use. Parse them with NLP so employees do not need security expertise to communicate effectively. Score them contextually so analysts focus on what matters. None of this is technically novel. The novelty, if any, is in assembly and application. My preliminary numbers suggest 90% time reduction from intake to triage. Maybe production deployment confirms that. Maybe it reveals problems I have not anticipated. Either outcome advances our understanding. The core argument stands regardless: we have systematically underinvested in making it easy for humans to tell us what they see. Every employee is a potential sensor. We just need to build pipes that do not require a security degree to use.

References

[1] Verizon, "2024 Data Breach Investigations Report," 2024.

[2] Ponemon Institute, "Cost of a Data Breach Report 2024," IBM Security, 2024.

[3] M. Souppaya and K. Scarfone, "Guide to Enterprise Patch Management," NIST SP 800-40r4, 2022. [4] FIRST, "CVSS v3.1 Specification," 2019.

[5] M. Ferrag et al., "Deep learning for cyber security intrusion detection," J. Inf. Sec. Appl., vol. 50, 2020.

[6] CISA, "Coordinated Vulnerability Disclosure Process," 2024.

[7] ISO/IEC 30111:2019, "Vulnerability handling processes."

[8] S. Ransbotham et al., "Reshaping Business with AI," MIT Sloan Mgmt. Rev., 2017. [9] A. Khraisat et al., "Survey of intrusion detection systems," Cybersecurity, vol. 2, 2019.

[10] T. Wagner et al., "Cyber threat intelligence sharing," Comput. Secur., vol. 87, 2019. [11] M. Humayun et al., "IoT and ransomware," Egyptian Inform. J., vol. 22, 2021.

[12] R. Heartfield and G. Loukas, "Cyber-physical threats in smart homes," Comput. Secur., vol. 78, 2018.

[13] S. Samtani et al., "SCADA vulnerability assessment," Proc. IEEE ISI, 2016.

[14] J. Devlin et al., "BERT: Pre-training of deep bidirectional transformers," NAACL, 2019.

[15] K. Scarfone and P. Mell, "Guide to IDPS," NIST SP 800-94, 2007.

[16] Bugcrowd, "Inside the Mind of a Hacker," 2024.

[17] HackerOne, "Hacker-Powered Security Report," 2023.

[18] Tenable, "Vulnerability Management Best Practices," 2024.

[19] Rapid7, "InsightVM Documentation," 2024.

Why Your Employees Aren't Reporting Security Threats and What to Build Instead