TextSpitter

Transform documents into insights, effortlessly and efficiently.

Extract text from any document format

A lightweight Python library for extracting text from PDFs, DOCX, TXT, CSV, and 50+ source code file types. Supports file paths, streams, and raw bytes.

๐Ÿ“ฆ PyPI Package ๐Ÿ Python 3.12+ โœจ Type Hints ๐Ÿ“š Documented

โœจ Key Features

๐Ÿ“„ Multi-Format Support

PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50+ programming language file types.

๐Ÿ”Œ Stream-First API

Accepts file paths, BytesIO, SpooledTemporaryFile, or raw bytes. No temporary files required.

๐Ÿ› ๏ธ Optional Logging

Built-in loguru support with automatic fallback to stdlib logging. Install with textspitter[logging].

๐Ÿ–ฅ๏ธ CLI Included

Use uv tool install textspitter for a command-line tool. Perfect for quick extractions.

๐Ÿš€ CI/CD Ready

Automated testing across Python 3.12โ€“3.14 with GitHub Actions. Published to PyPI automatically.

๐Ÿงช Well Tested

~80 pytest tests covering all readers, input types, and edge cases. 89%+ code coverage.

๐Ÿ“‹ Supported Formats

Format Method Notes
PDF pdf_file_read() PyMuPDF โ†’ PyPDF fallback for compatibility
DOCX docx_file_read() python-docx paragraph extraction
TXT text_file_read() UTF-8 โ†’ latin-1 โ†’ UTF-8-replace encoding cascade
CSV csv_file_read() Same encoding cascade as TXT
Source Code code_file_read() 50+ extensions (py, js, ts, go, rs, java, ruby, php, โ€ฆ)

โšก Quick Start

Install

pip install textspitter # With optional loguru logging pip install "textspitter[logging]"

Extract Text

from TextSpitter import TextSpitter # From a file text = TextSpitter(filename="report.pdf") # From a stream from io import BytesIO text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf") # From raw bytes text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")

Use the CLI

# Single file to stdout textspitter report.pdf # Multiple files to combined output textspitter chapter1.pdf chapter2.pdf -o book.txt

๐Ÿ“š Documentation

Choose your path based on what you need:

โ“ FAQ

Why TextSpitter?

TextSpitter normalizes diverse input types (file paths, streams, bytes) into plain strings. It's ideal for LLM pipelines, search engines, and data-processing workflows that need clean text extraction without the complexity of framework-specific adapters.

What's the difference between TextSpitter and python-docx / PyMuPDF?

TextSpitter is a format dispatcher that handles PDF, DOCX, TXT, CSV, and source code files with a single API. It wraps libraries like PyMuPDF and python-docx, adding automatic format detection, fallback chains, and stream support. You get consistency across all formats.

Does it support scanned PDFs?

No. TextSpitter extracts embedded text layers only. Scanned PDFs with no text layer return an empty string. Pre-process with OCR (e.g., pytesseract) and pass the text directly if needed.

How do I enable logging?

Install with pip install "textspitter[logging]" to enable loguru. Without it, TextSpitter falls back to stdlib logging automatically. See the Recipes section for configuration examples.

Can I use TextSpitter in async code?

TextSpitter's main API is synchronous. For async workflows, wrap extraction in asyncio.to_thread() or use a thread pool executor.