End-to-end walkthrough covering every supported format
PDF files
From a file path
from TextSpitter import TextSpitter
text = TextSpitter(filename="annual_report.pdf")
print(f"Extracted {len(text)} characters")Internally, pdf_file_read() tries PyMuPDF first, then falls back to pypdf if PyMuPDF fails for any reason (import error, corrupt file, platform restriction).
From bytes (downloaded at runtime)
import httpx
from TextSpitter import TextSpitter
response = httpx.get("https://example.com/report.pdf")
text = TextSpitter(file_obj=response.content, filename="report.pdf")Scanned / image-only PDFs
TextSpitter extracts embedded text layers only. Scanned PDFs with no text layer return an empty string. Pre-process with an OCR tool (e.g. pytesseract) and pass the resulting text directly.
Word documents (DOCX)
From a file path
from TextSpitter import TextSpitter
text = TextSpitter(filename="contract.docx")docx_file_read() uses python-docx to iterate every Paragraph and joins them with newlines.
From a FastAPI upload
from fastapi import UploadFile
from io import BytesIO
from TextSpitter import TextSpitter
async def extract(file: UploadFile) -> str:
data = await file.read()
return TextSpitter(file_obj=BytesIO(data), filename=file.filename)Plain text and CSV
TXT
text = TextSpitter(filename="notes.txt")text_file_read() tries UTF-8, then latin-1. If both fail, falls back to UTF-8 with replacement characters and logs a warning.
CSV — raw string
import csv, io
from TextSpitter import TextSpitter
raw = TextSpitter(filename="data.csv")
rows = list(csv.reader(io.StringIO(raw)))Non-UTF-8 encoded files
Files saved in latin-1 or Windows-1252 are handled automatically — no configuration required.
text = TextSpitter(filename="legacy_export.txt") # cp1252? latin-1? works.Source code files
50+ programming-language extensions are routed through code_file_read(), which uses a wider encoding cascade than the plain-text reader.
Single file
source = TextSpitter(filename="main.py")Batch extraction from a directory
from pathlib import Path
from TextSpitter import TextSpitter
texts = {
str(p): TextSpitter(filename=str(p))
for p in Path("my_project").rglob("*.py")
}Supported extensions
.py .js .ts .jsx .tsx .java .kt .go .rs .cpp .c .cs .rb .php .swift .dart .elm .ex .erl .jl .nim .zig .lua .r .sql .sh .bash .html .css .scss .json .yaml .toml .xml .md .rst and more.
Using WordLoader directly
from TextSpitter.main import WordLoader
loader = WordLoader(filename="document.pdf")
text = loader.file_load()Using FileExtractor directly
from TextSpitter.core import FileExtractor
fe = FileExtractor(file_obj=pdf_bytes, filename="report.pdf")
raw = fe.get_contents() # always bytes
text = fe.pdf_file_read() # extracted textUseful when you need to call multiple reader methods on the same file, or inspect file_name / file_ext before extracting.