Tutorial

← Back to Home

End-to-end walkthrough covering every supported format

PDF files

From a file path
from TextSpitter import TextSpitter text = TextSpitter(filename="annual_report.pdf") print(f"Extracted {len(text)} characters")

Internally, pdf_file_read() tries PyMuPDF first, then falls back to pypdf if PyMuPDF fails for any reason (import error, corrupt file, platform restriction).

From bytes (downloaded at runtime)
import httpx from TextSpitter import TextSpitter response = httpx.get("https://example.com/report.pdf") text = TextSpitter(file_obj=response.content, filename="report.pdf")
Scanned / image-only PDFs

TextSpitter extracts embedded text layers only. Scanned PDFs with no text layer return an empty string. Pre-process with an OCR tool (e.g. pytesseract) and pass the resulting text directly.

Word documents (DOCX)

From a file path
from TextSpitter import TextSpitter text = TextSpitter(filename="contract.docx")

docx_file_read() uses python-docx to iterate every Paragraph and joins them with newlines.

From a FastAPI upload
from fastapi import UploadFile from io import BytesIO from TextSpitter import TextSpitter async def extract(file: UploadFile) -> str: data = await file.read() return TextSpitter(file_obj=BytesIO(data), filename=file.filename)

Plain text and CSV

TXT
text = TextSpitter(filename="notes.txt")

text_file_read() tries UTF-8, then latin-1. If both fail, falls back to UTF-8 with replacement characters and logs a warning.

CSV — raw string
import csv, io from TextSpitter import TextSpitter raw = TextSpitter(filename="data.csv") rows = list(csv.reader(io.StringIO(raw)))
Non-UTF-8 encoded files

Files saved in latin-1 or Windows-1252 are handled automatically — no configuration required.

text = TextSpitter(filename="legacy_export.txt") # cp1252? latin-1? works.

Source code files

50+ programming-language extensions are routed through code_file_read(), which uses a wider encoding cascade than the plain-text reader.

Single file
source = TextSpitter(filename="main.py")
Batch extraction from a directory
from pathlib import Path from TextSpitter import TextSpitter texts = { str(p): TextSpitter(filename=str(p)) for p in Path("my_project").rglob("*.py") }
Supported extensions

.py .js .ts .jsx .tsx .java .kt .go .rs .cpp .c .cs .rb .php .swift .dart .elm .ex .erl .jl .nim .zig .lua .r .sql .sh .bash .html .css .scss .json .yaml .toml .xml .md .rst and more.

Using WordLoader directly

from TextSpitter.main import WordLoader loader = WordLoader(filename="document.pdf") text = loader.file_load()

Using FileExtractor directly

from TextSpitter.core import FileExtractor fe = FileExtractor(file_obj=pdf_bytes, filename="report.pdf") raw = fe.get_contents() # always bytes text = fe.pdf_file_read() # extracted text

Useful when you need to call multiple reader methods on the same file, or inspect file_name / file_ext before extracting.