Tutorial - TextSpitter

← Back to Home

End-to-end walkthrough covering every supported format

PDF files

From a file path

from TextSpitter import TextSpitter

text = TextSpitter(filename="annual_report.pdf")
print(f"Extracted {len(text)} characters")

Internally, pdf_file_read() tries PyMuPDF first, then falls back to pypdf if PyMuPDF fails for any reason (import error, corrupt file, platform restriction).

From bytes (downloaded at runtime)

import httpx
from TextSpitter import TextSpitter

response = httpx.get("https://example.com/report.pdf")
text = TextSpitter(file_obj=response.content, filename="report.pdf")

Scanned / image-only PDFs

TextSpitter extracts embedded text layers only. Scanned PDFs with no text layer return an empty string. Pre-process with an OCR tool (e.g. pytesseract) and pass the resulting text directly.

Word documents (DOCX)

From a file path

from TextSpitter import TextSpitter

text = TextSpitter(filename="contract.docx")

docx_file_read() uses python-docx to iterate every Paragraph and joins them with newlines.

From a FastAPI upload

from fastapi import UploadFile
from io import BytesIO
from TextSpitter import TextSpitter

async def extract(file: UploadFile) -> str:
    data = await file.read()
    return TextSpitter(file_obj=BytesIO(data), filename=file.filename)

Plain text and CSV

TXT

text = TextSpitter(filename="notes.txt")

text_file_read() tries UTF-8, then latin-1. If both fail, falls back to UTF-8 with replacement characters and logs a warning.

CSV — raw string

import csv, io
from TextSpitter import TextSpitter

raw = TextSpitter(filename="data.csv")
rows = list(csv.reader(io.StringIO(raw)))

Non-UTF-8 encoded files

Files saved in latin-1 or Windows-1252 are handled automatically — no configuration required.

text = TextSpitter(filename="legacy_export.txt") # cp1252? latin-1? works.

Source code files

50+ programming-language extensions are routed through code_file_read(), which uses a wider encoding cascade than the plain-text reader.

Single file

source = TextSpitter(filename="main.py")

Batch extraction from a directory

from pathlib import Path
from TextSpitter import TextSpitter

texts = {
    str(p): TextSpitter(filename=str(p))
    for p in Path("my_project").rglob("*.py")
}

Supported extensions

.py .js .ts .jsx .tsx .java .kt .go .rs .cpp .c .cs .rb .php .swift .dart .elm .ex .erl .jl .nim .zig .lua .r .sql .sh .bash .html .css .scss .json .yaml .toml .xml .md .rst and more.

Using `WordLoader` directly

from TextSpitter.main import WordLoader

loader = WordLoader(filename="document.pdf")
text = loader.file_load()

Using `FileExtractor` directly

from TextSpitter.core import FileExtractor

fe = FileExtractor(file_obj=pdf_bytes, filename="report.pdf")
raw   = fe.get_contents()    # always bytes
text  = fe.pdf_file_read()   # extracted text

Useful when you need to call multiple reader methods on the same file, or inspect file_name / file_ext before extracting.