- SOURCE_TYPE_PATTERN um pdf_document erweitert - src/services/pdf_ingest.py: pdfplumber + Tesseract-OCR-Fallback, Uebersetzung nach DE+EN, ein Pool-Artikel pro PDF - Scheduler-Job pdf_ingest laeuft im Minuten-Takt und verarbeitet pdf_document-Quellen mit processed_at IS NULL - scripts/migrate_pdf_source.py: idempotente DB-Migration (sources.pdf_path/pdf_sha256/processed_at, articles.headline_en/content_en) - requirements.txt: pdfplumber, pytesseract, pdf2image, Pillow
24 Zeilen
399 B
Plaintext
24 Zeilen
399 B
Plaintext
fastapi==0.115.6
|
|
uvicorn[standard]==0.34.0
|
|
python-jose[cryptography]
|
|
bcrypt
|
|
aiosqlite
|
|
feedparser
|
|
httpx
|
|
apscheduler==3.10.4
|
|
websockets
|
|
python-multipart
|
|
aiosmtplib
|
|
geonamescache>=2.0
|
|
telethon
|
|
# Bericht-Export (PDF via WeasyPrint + DOCX via python-docx)
|
|
Jinja2>=3.1
|
|
weasyprint>=68.0
|
|
python-docx>=1.2
|
|
pikepdf>=9.0
|
|
# PDF-Quellen (Ingestion)
|
|
pdfplumber>=0.11
|
|
pytesseract>=0.3
|
|
pdf2image>=1.17
|
|
Pillow>=10.0
|