sift 0.3.3 (Python) / 0.2.0 (Rust) — intelligent structured extraction from resumes. Runs entirely offline, no API keys required. MIT licensed.
The 30-year-old problem nobody fixed
Resume screening is one of those processes that should have been fully automated a long time ago — and somehow never was.
Every applicant tracking system, every recruiting tool, every "automated screening" pipeline has the same dirty secret underneath: a fragile collection of pattern-matching rules and heuristics held together by years of patches and workarounds. "If this line starts with capitalized words and doesn't contain an @ sign, it's probably a name. Unless it's a section header. Unless the candidate wrote it in all caps. Unless—"
Anyone who has ever tried to extract name, email, work experience, and skills from a real-world resume knows the drill. Resumes come in a hundred different layouts. Some put dates on the left, some on the right, some omit them entirely. Skills appear as bullet points, tables, paragraphs, or sidebars. Education sometimes comes before experience, sometimes after. International phone formats. Creative typography. Two-column layouts that turn into gibberish when processed by standard tools.
Recruiting teams have been dealing with this for decades. And the tools keep failing the moment a candidate submits a resume that looks slightly different.
Why traditional resume parsers fail:
- Rigid pattern matching can't handle the enormous variety of resume layouts
- Section detection is unreliable and breaks on unfamiliar formats
- International resumes multiply the edge cases exponentially
- Every "unusual" resume requires another manual patch
- Bullet points, paragraphs, and tables contain the same data but confuse different parsers
- The document conversion process itself loses information and scrambles the order
Resumes are inherently unstructured. We were trying to solve an unstructured problem with rigid tools. That mismatch is the bug.
Why we built Sift
We needed reliable resume data for an internal hiring tool. We tried the existing open-source parsers. We tried commercial resume parsing services. The accuracy was either disappointing or expensive — and crucially, none of them could run entirely on our own infrastructure for privacy-sensitive hiring workloads.
So we asked a different question: what if we stopped trying to write rules and instead taught a system to read resumes the way a human does?
Modern AI language models — even small ones — can read a document and extract structured information with surprisingly good accuracy. They handle layout variation naturally because they read like a person does: top to bottom, using context, ignoring decorative elements. They don't care whether dates are on the left or right. They don't break when someone uses a sidebar.
The constraint we set for ourselves: no external service required by default. If you can't run it offline, on a laptop, with no internet connection, it doesn't qualify. Privacy isn't a premium feature — it's the baseline.
So we built Sift.
A high-performance core in Rust (for speed and efficiency), a Python interface (because that's where most hiring tools are built), and a built-in AI model that downloads automatically the first time you use it.
The name captures the idea: sifting through documents to find what matters.
What Sift actually is
Sift is an intelligent, rule-free document extractor — a focused tool that uses a small AI model to turn resumes into structured, typed data.
The process is intentionally simple:
- Load — accepts PDF, DOCX, HTML, or plain text resumes
- Analyze — the AI model reads the entire document contextually, guided by a schema that defines what fields to extract
- Extract — the model produces structured data (name, email, experience, skills, education, etc.)
- Return — a clean, typed record ready for your hiring pipeline
That's the whole process. No pattern matching. No section detection heuristics. No format-specific logic. No brittle rules that break on the next unusual resume.
Runs entirely offline, by default
The headline feature: Sift works without any internet connection. It uses a compact AI model that downloads once and then runs entirely on your machine. No API keys. No cloud services. No data leaving your infrastructure.
Why does this matter? Because resumes contain personal information — names, addresses, phone numbers, employment history. Sending that data to a third-party service for parsing creates privacy and compliance risks. Sift keeps everything local, always.
When you're ready to scale, you can optionally connect it to cloud-based AI models for higher throughput:
# Use OpenAI's models instead of local
extractor = Extractor(model="openai:gpt-4o-mini")
# Use your own private AI endpoint
extractor = Extractor(model="openai-compatible:https://api.my-provider.com/v1")
The same interface, whether you're running on a laptop or scaling across a cluster.
How it works
Get started in seconds (Python)
Install the package and hand it a resume file:
pip install sift
from sift import Extractor
# Works immediately — the AI model downloads automatically on first use
extractor = Extractor()
resume = extractor.extract_resume("resume.pdf")
print(resume["name"])
print(resume["email"])
for job in resume["experience"]:
print(f"{job['role']} at {job['company']}")
That's a complete, intelligent resume parser. No configuration. No training. No fragile rules.
What it extracts
Sift produces a comprehensive structured record from every resume:
| Field | What it captures |
|---|---|
| Personal info | Name, email, phone, location, LinkedIn, GitHub, website |
| Professional summary | The candidate's self-description |
| Work experience | Company, role, dates, key accomplishments |
| Education | Institution, degree, field of study, dates |
| Skills | Organized by category or listed individually |
| Projects | Name, technologies used, links |
| Certifications | Professional certifications and credentials |
| Languages | Languages spoken with proficiency levels |
Every field is optional — if a resume doesn't include certifications, that section is simply empty. The AI model understands what to look for and what to skip.
Extract anything, not just resumes
While resume parsing is the default use case, Sift's extraction engine works with any document structure. Need to pull just contact information from a pile of business cards? Extract publication lists from academic CVs? Pull invoice data from vendor documents? Define what you need and the same engine handles it.
What's next
Sift is small, focused, and intentionally narrow — but the natural extensions are clear:
- Confidence scores — indicate how certain the extraction is for each field, so your team knows what to verify manually
- Domain-specific templates — pre-configured extraction profiles for academic CVs, technical resumes, government applications
- Batch processing — extract data from an entire folder of resumes in one run
- Streaming extraction — start seeing results as the model processes each section
If you've ever watched your hiring team lose good candidates because their resume was "too unusual" for the parser, or spent hours manually entering resume data into your ATS, Sift is for you.
Get started in one line — pick your language:
# Python
pip install sift
# Rust
cargo add sift
No API key. No cloud service. The AI model downloads automatically on first use. The source is on GitHub — issues and PRs welcome.