โจ Key Features
๐ Multi-Format Support
PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50+ programming language file types.
๐ Stream-First API
Accepts file paths, BytesIO, SpooledTemporaryFile, or raw bytes. No temporary files required.
๐ ๏ธ Optional Logging
Built-in loguru support with automatic fallback to stdlib logging. Install with textspitter[logging].
๐ฅ๏ธ CLI Included
Use uv tool install textspitter for a command-line tool. Perfect for quick extractions.
๐ CI/CD Ready
Automated testing across Python 3.12โ3.14 with GitHub Actions. Published to PyPI automatically.
๐งช Well Tested
~80 pytest tests covering all readers, input types, and edge cases. 89%+ code coverage.
๐ Supported Formats
| Format | Method | Notes |
|---|---|---|
pdf_file_read() |
PyMuPDF โ PyPDF fallback for compatibility | |
| DOCX | docx_file_read() |
python-docx paragraph extraction |
| TXT | text_file_read() |
UTF-8 โ latin-1 โ UTF-8-replace encoding cascade |
| CSV | csv_file_read() |
Same encoding cascade as TXT |
| Source Code | code_file_read() |
50+ extensions (py, js, ts, go, rs, java, ruby, php, โฆ) |
โก Quick Start
Install
pip install textspitter
# With optional loguru logging
pip install "textspitter[logging]"Extract Text
from TextSpitter import TextSpitter
# From a file
text = TextSpitter(filename="report.pdf")
# From a stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")
# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")Use the CLI
# Single file to stdout
textspitter report.pdf
# Multiple files to combined output
textspitter chapter1.pdf chapter2.pdf -o book.txt๐ Documentation
Choose your path based on what you need:
๐ฏ Quick Start
Install and run your first extraction in under 2 minutes.
๐ Technical Overview
Architecture, module design, and implementation details.
๐ Tutorial
Format-by-format walkthrough with real examples.
๐ผ Use Cases
FastAPI, S3, LangChain, batch processing patterns.
๐ Recipes
Copy-paste snippets for common tasks.
โ๏ธ API Reference
Complete API documentation with all classes and methods.
โ FAQ
Why TextSpitter?
TextSpitter normalizes diverse input types (file paths, streams, bytes) into plain strings. It's ideal for LLM pipelines, search engines, and data-processing workflows that need clean text extraction without the complexity of framework-specific adapters.
What's the difference between TextSpitter and python-docx / PyMuPDF?
TextSpitter is a format dispatcher that handles PDF, DOCX, TXT, CSV, and source code files with a single API. It wraps libraries like PyMuPDF and python-docx, adding automatic format detection, fallback chains, and stream support. You get consistency across all formats.
Does it support scanned PDFs?
No. TextSpitter extracts embedded text layers only. Scanned PDFs with no text layer return an empty string. Pre-process with OCR (e.g., pytesseract) and pass the text directly if needed.
How do I enable logging?
Install with pip install "textspitter[logging]" to enable loguru. Without it, TextSpitter falls back to stdlib logging automatically. See the Recipes section for configuration examples.
Can I use TextSpitter in async code?
TextSpitter's main API is synchronous. For async workflows, wrap extraction in asyncio.to_thread() or use a thread pool executor.