TextSpitter

✨ Key Features

📄 Multi-Format Support

PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50+ programming language file types.

🔌 Stream-First API

Accepts file paths, BytesIO, SpooledTemporaryFile, or raw bytes. No temporary files required.

🛠️ Optional Logging

Built-in loguru support with automatic fallback to stdlib logging. Install with textspitter[logging].

🖥️ CLI Included

Use uv tool install textspitter for a command-line tool. Perfect for quick extractions.

🚀 CI/CD Ready

Automated testing across Python 3.12–3.14 with GitHub Actions. Published to PyPI automatically.

🧪 Well Tested

~80 pytest tests covering all readers, input types, and edge cases. 89%+ code coverage.

📋 Supported Formats

Format	Method	Notes
PDF	`pdf_file_read()`	PyMuPDF → PyPDF fallback for compatibility
DOCX	`docx_file_read()`	python-docx paragraph extraction
TXT	`text_file_read()`	UTF-8 → latin-1 → UTF-8-replace encoding cascade
CSV	`csv_file_read()`	Same encoding cascade as TXT
Source Code	`code_file_read()`	50+ extensions (py, js, ts, go, rs, java, ruby, php, …)

⚡ Quick Start

Install

pip install textspitter

# With optional loguru logging
pip install "textspitter[logging]"

Extract Text

from TextSpitter import TextSpitter

# From a file
text = TextSpitter(filename="report.pdf")

# From a stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")

# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")

Use the CLI

# Single file to stdout
textspitter report.pdf

# Multiple files to combined output
textspitter chapter1.pdf chapter2.pdf -o book.txt

📚 Documentation

Choose your path based on what you need:

🎯 Quick Start

Install and run your first extraction in under 2 minutes.

🔍 Technical Overview

Architecture, module design, and implementation details.

📖 Tutorial

Format-by-format walkthrough with real examples.

💼 Use Cases

FastAPI, S3, LangChain, batch processing patterns.

⚙️ API Reference

Complete API documentation with all classes and methods.

❓ FAQ

Why TextSpitter?

TextSpitter normalizes diverse input types (file paths, streams, bytes) into plain strings. It's ideal for LLM pipelines, search engines, and data-processing workflows that need clean text extraction without the complexity of framework-specific adapters.

What's the difference between TextSpitter and python-docx / PyMuPDF?

TextSpitter is a format dispatcher that handles PDF, DOCX, TXT, CSV, and source code files with a single API. It wraps libraries like PyMuPDF and python-docx, adding automatic format detection, fallback chains, and stream support. You get consistency across all formats.

Does it support scanned PDFs?

No. TextSpitter extracts embedded text layers only. Scanned PDFs with no text layer return an empty string. Pre-process with OCR (e.g., pytesseract) and pass the text directly if needed.

How do I enable logging?

Install with pip install "textspitter[logging]" to enable loguru. Without it, TextSpitter falls back to stdlib logging automatically. See the Recipes section for configuration examples.

Can I use TextSpitter in async code?

TextSpitter's main API is synchronous. For async workflows, wrap extraction in asyncio.to_thread() or use a thread pool executor.

Extract text from any document format