myHR — Enterprise RAG Chatbot for Internal HR Policy
A production-grade Retrieval-Augmented Generation system that gives 14,000+ employees instant, grounded answers to HR policy questions — replacing 1–2 day email wait times with sub-second responses, achieving >90% answer accuracy on an internal evaluation set, and enforcing business-unit and country-level access control at the retrieval layer.
Try Live Demo
1The Problem
A large Sri Lankan conglomerate with 14,000+ employees across multiple business verticals had a persistent but invisible productivity drain: HR policy knowledge was locked inside dozens of policy PDFs, benefits documents, holiday calendars, and insurance plan files spread across S3 and Microsoft SharePoint. Employees asking routine questions — 'What are our paid holidays?', 'How does parental leave work?', 'What are my health insurance options?' — waited 1–2 days for email replies from HR staff who spent the majority of their time re-answering the same questions. There was no searchable, access-controlled, conversational layer over these documents. Standard keyword search failed because questions were natural language and documents were unstructured. The HR team could not scale policy communication as headcount and business units grew, and sensitive policy documents needed to remain visible only to employees in the relevant country and business unit.
2The Approach
We designed and implemented a two-pipeline RAG architecture on AWS. An offline batch ingestion pipeline (ECS Fargate) pulls HR documents from S3 and SharePoint, extracts and cleans text, chunks documents into 200–300 token windows with overlap, generates 1536-dimensional embeddings via Amazon Titan Embeddings v2, and bulk-indexes chunks with metadata (business unit, country, ACL groups) into Amazon OpenSearch's k-NN Vector Engine. A real-time query pipeline handles each employee request: the authenticated query passes through API Gateway and AWS WAF to an AWS Lambda orchestrator, which embeds the question with Titan, runs a k-NN search on OpenSearch with RBAC metadata filters to restrict results to the user's business unit and country, constructs a grounded prompt from the top-K retrieved chunks, and calls Claude 3.7 Sonnet on Amazon Bedrock for answer generation. The two pipelines share the same OpenSearch index cluster, keeping infrastructure lean while keeping ingestion fully decoupled from query-time logic. The implementation (GitHub) uses FastAPI, GPT-4o, and Qdrant as a portable local stack that mirrors the production AWS design.
Technical Architecture
Data Sources: HR policy PDFs, benefits documents, holiday calendars, and SOPs stored in Amazon S3; internal HR content and SharePoint documents from Microsoft SharePoint across business units
Offline Ingestion Pipeline (ECS Fargate): Batch job that pulls documents from S3 and SharePoint, extracts and cleans text (PDF/DOCX/HTML → plain text, strips boilerplate), chunks into 200–300 token windows with overlap, generates 1536-dim embeddings via Amazon Titan Embeddings v2, and bulk-indexes into OpenSearch with full metadata (BU, country, ACL groups, doc_id, chunk_id)
Vector Index (Amazon OpenSearch): Shared k-NN vector index used by both pipelines; stores chunk_text, embedding vectors, and metadata; supports hybrid search (semantic + lexical) and fine-grained RBAC/ACL metadata filtering per query
Real-Time Query Pipeline (AWS Lambda): On each employee query — embed question with Titan, run k-NN search on OpenSearch with RBAC metadata filters, select top-K chunks, construct grounded prompt, call Claude 3.7 Sonnet on Bedrock, return answer to UI; all latency and token usage logged
Security Layer: AWS WAF + API Gateway for rate limiting and API protection; Company SSO (SAML 2.0) for employee authentication; RBAC/ACL metadata filters at retrieval time ensure users only see policies relevant to their business unit and country
Frontend: Internal employee portal web UI with React chat interface; authenticated via company SSO / employee ID; thumbs up/down feedback per answer feeds the evaluation data lake
Observability & Evaluation: CloudWatch for live logs, metrics, and alarms; S3 Data Lake for evaluation logs using LLM-as-Judge pattern; offline golden-set evaluation measuring MRR, Recall@k, correctness, faithfulness, and coverage; online metrics for bot answer rate, HR ticket volume, and user feedback
Local / Portable Implementation Stack: FastAPI RAG backend, GPT-4o for generation, Qdrant as vector DB, React + Vite frontend — folder-based metadata (BU/country) mirrors production RBAC design
Results
Achieved >90% answer accuracy on an internal HR golden evaluation set validated by HR reviewers
Reduced employee wait time from 1–2 days (email) to sub-second (chatbot) for all routine HR policy queries
HR team's repetitive workload dropped significantly, freeing staff for higher-value tasks: case handling, policy design, and employee engagement
Retrieval pipeline tuning lifted MRR from 0.62 → 0.74 and Recall@5 from 0.80 → 0.91 on the golden evaluation set
RBAC/ACL metadata filtering at the OpenSearch layer ensures zero cross-BU or cross-country policy leakage
Reusable RAG stack: the same OpenSearch index architecture can be extended to IT helpdesk, finance FAQs, or any new knowledge domain by adding indices and metadata — no pipeline rebuild required
Key Insights
“Decoupling the ingestion pipeline from the query pipeline is not just a nice architectural principle — it's operationally essential. Being able to re-chunk, re-embed, and re-index without touching the live chatbot means you can safely tune retrieval quality without any downtime or risk to production.”
“RBAC at the retrieval layer (metadata filters on OpenSearch) is the correct place to enforce access control in a RAG system — not in the LLM prompt. Prompt-level 'don't reveal this' instructions are not a security boundary; metadata filters that prevent documents from being retrieved in the first place are.”
“Building a golden evaluation set before going live is what separates a production RAG system from a demo. Tracking MRR and Recall@k independently of generation quality lets you isolate whether a quality regression is a retrieval problem or a prompt/LLM problem — two completely different fixes.”
“LLM-as-Judge evaluation (Bedrock scoring correctness, faithfulness, and coverage against golden answers) scales QA to hundreds of test cases without requiring HR reviewer time on every pipeline change — a force multiplier for iterating safely on a live system.”
Tech Stack
Want to see it in action?
Try the live demo yourself, or get in touch if you'd like to discuss how this approach could work for your business.