Architecture, module map, and design decisions
Module map
TextSpitter/
├── __init__.py TextSpitter() — convenience function + __version__
├── cli.py main() — argparse CLI entry point
├── core.py FileExtractor — low-level reader
├── guide/ — this user guide (subpackage)
├── logger.py logger — optional loguru / stdlib shim
└── main.py WordLoader — format dispatcherThree-layer design
Layer 1 — Public surface (__init__.py)
TextSpitter(file_obj, filename) is a plain function that creates a WordLoader, calls file_load(), and returns the str.
from TextSpitter import TextSpitter
text = TextSpitter(filename="notes.txt")Layer 2 — Dispatcher (WordLoader)
WordLoader holds two class-level constants:
FILE_EXT_MATRIX— maps file extensions toFileExtractormethod names.TEXT_MIME_TYPES— MIME subtypes treated as plain-text.
file_load() checks the extension first, then is_programming_language_file(), then MIME type. Unknown formats log an error and return "".
Layer 3 — Reader (FileExtractor)
FileExtractor.__init__ resolves any input into three normalised attributes:
| Attribute | Type | Description |
|---|---|---|
file |
Path | IO | bytes |
The underlying file object |
file_name |
str |
Filename including extension |
file_ext |
str |
Lowercase extension without dot |
All reader methods call get_contents() first, which always returns bytes.
Input resolution
get_contents() handles four input shapes:
self.file type |
Behaviour |
|---|---|
Path |
Opens in binary mode and reads |
BytesIO / SpooledTemporaryFile |
Seeks to 0, then reads |
bytes |
Returned as-is |
Other stream (IOBase) |
Cast to BinaryIO, seek, read |
PDF fallback chain
pymupdf.open(stream=…)
│ success → return text
└─ ImportError / any Exception
│
▼
pypdf.PdfReader(BytesIO(…))
│ success → return text
└─ Exception
│
▼
log error, return ""Encoding strategy
_decode_bytes() uses a three-step cascade for TXT and CSV:
bytes.decode("utf-8")— strictbytes.decode("latin-1")— strict (never raises for any byte value)bytes.decode("utf-8", errors="replace")— logs a warning
code_file_read() tries a wider list: utf-8 → utf-8-sig → latin-1 → cp1252 → utf-8/replace.
Optional logging
logger.py exports a single logger object. At import time it tries from loguru import logger; on ImportError it falls back to logging.getLogger("textspitter"). All call sites use this shim, so the rest of the codebase is backend-agnostic.
pip install "textspitter[logging]" # enable loguru