Podcast-Integration Phase 1: Feed-Tag + Senderseiten
Podcasts werden wie normale RSS-Quellen behandelt (source_type=podcast_feed).
Kein externer bezahlter Dienst, keine lokale Transkription — Monitor nutzt
ausschliesslich vorhandene Transkripte.
Kaskade fuer Transkript-Bezug:
1. Podcasting-2.0-Tag <podcast:transcript> im Feed (SRT/VTT/HTML/JSON)
2. Redaktionelles Manuskript auf der Episodenseite
(Adapter: Dlf, SZ, Spiegel, NDR)
3. YouTube-Captions — Phase 2, optional per yt-dlp
Kein Stufen-Treffer -> Episode verworfen (graceful, kein Error).
Neu:
- src/feeds/podcast_parser.py (eigener Parser, RSS-Heisspfad unveraendert)
- src/feeds/transcript_extractors/ (Plugin-Muster):
__init__.py Dispatcher, Cache-Lookup gegen podcast_transcripts
_common.py HTML-Extraktion, Domain-Matching, httpx-Helper
rss_native.py Stufe 1: Feed-Tag-Parser (SRT/VTT/JSON/HTML)
website_dlf.py Stufe 2: deutschlandfunk.de + Schwester-Domains
website_sz.py Stufe 2: sz.de / sueddeutsche.de
website_spiegel.py Stufe 2: spiegel.de / manager-magazin.de
website_ndr.py Stufe 2: ndr.de
Geaendert:
- src/database.py: idempotente Migration, Tabelle podcast_transcripts als
URL-Cache gegen Mehrfach-Scrape zwischen Lagen
- src/models.py: Pydantic-Pattern von source_type um podcast_feed erweitert
- src/source_rules.py: get_feeds_with_metadata() nimmt source_type-Parameter,
Default rss_feed (RSS-Pfad unveraendert)
- src/agents/orchestrator.py: neue _podcast_pipeline() parallel zu RSS,
WebSearch und Telegram; nur fuer adhoc-Lagen; ohne Podcast-Quellen dormant
Verifikation:
- Migration auf Live-DB erfolgreich (Log: Tabelle podcast_transcripts angelegt)
- Import-/Instanziierungs-Test aller Module bestanden
- can_handle-Tests pro Sender-Adapter positiv + negativ OK
- Live-Scrape gegen Dlf: 22710 Zeichen, gegen SZ: 24918 Zeichen
- Dormant-Test: 0 Podcast-Quellen -> keine neue Codezeile im Refresh
Verwerfbarkeit: rein additiv, RSS-Pfad unberuehrt, Rollback in drei
Schritten (Quellen disablen, git revert, DROP TABLE podcast_transcripts).
Dieser Commit ist enthalten in:
@@ -781,6 +781,39 @@ class AgentOrchestrator:
|
||||
logger.info(f"Claude-Recherche: {len(results)} Ergebnisse")
|
||||
return results, usage
|
||||
|
||||
async def _podcast_pipeline():
|
||||
"""Podcast-Episoden-Suche (nur adhoc-Lagen, nur mit vorhandenen Transkripten)."""
|
||||
if incident_type != "adhoc":
|
||||
logger.info("Recherche-Modus: Podcasts uebersprungen")
|
||||
return [], None
|
||||
|
||||
from source_rules import get_feeds_with_metadata
|
||||
podcast_feeds = await get_feeds_with_metadata(tenant_id=tenant_id, source_type="podcast_feed")
|
||||
if not podcast_feeds:
|
||||
return [], None
|
||||
|
||||
from feeds.podcast_parser import PodcastFeedParser
|
||||
pd_parser = PodcastFeedParser()
|
||||
pd_researcher = ResearcherAgent()
|
||||
|
||||
# Dynamische Keywords (eigener Haiku-Call, parallel zu RSS —
|
||||
# billig und hält Pipelines unabhaengig)
|
||||
cursor_pd_hl = await db.execute(
|
||||
"""SELECT COALESCE(headline_de, headline) as hl
|
||||
FROM articles WHERE incident_id = ?
|
||||
AND COALESCE(headline_de, headline) IS NOT NULL
|
||||
ORDER BY collected_at DESC LIMIT 30""",
|
||||
(incident_id,),
|
||||
)
|
||||
pd_headlines = [row["hl"] for row in await cursor_pd_hl.fetchall() if row["hl"]]
|
||||
pd_keywords, pd_kw_usage = await pd_researcher.extract_dynamic_keywords(title, pd_headlines)
|
||||
if pd_kw_usage:
|
||||
usage_acc.add(pd_kw_usage)
|
||||
|
||||
articles = await pd_parser.search_feeds_selective(title, podcast_feeds, keywords=pd_keywords)
|
||||
logger.info(f"Podcast-Pipeline: {len(articles)} Episoden gefunden")
|
||||
return articles, None
|
||||
|
||||
async def _telegram_pipeline():
|
||||
"""Telegram-Kanal-Suche mit KI-basierter Kanal-Selektion."""
|
||||
from feeds.telegram_parser import TelegramParser
|
||||
@@ -821,8 +854,8 @@ class AgentOrchestrator:
|
||||
logger.info(f"Telegram-Pipeline: {len(articles)} Nachrichten")
|
||||
return articles, None
|
||||
|
||||
# Pipelines parallel starten (RSS + WebSearch + optional Telegram)
|
||||
pipelines = [_rss_pipeline(), _web_search_pipeline()]
|
||||
# Pipelines parallel starten (RSS + WebSearch + Podcasts + optional Telegram)
|
||||
pipelines = [_rss_pipeline(), _web_search_pipeline(), _podcast_pipeline()]
|
||||
if include_telegram:
|
||||
pipelines.append(_telegram_pipeline())
|
||||
|
||||
@@ -830,7 +863,12 @@ class AgentOrchestrator:
|
||||
|
||||
(rss_articles, rss_feed_usage) = pipeline_results[0]
|
||||
(search_results, search_usage) = pipeline_results[1]
|
||||
telegram_articles = pipeline_results[2][0] if include_telegram else []
|
||||
(podcast_articles, _podcast_usage) = pipeline_results[2]
|
||||
telegram_articles = pipeline_results[3][0] if include_telegram else []
|
||||
|
||||
# Podcast-Artikel in die RSS-Liste einfuegen (gleicher Downstream-Pfad)
|
||||
if podcast_articles:
|
||||
rss_articles = (rss_articles or []) + podcast_articles
|
||||
|
||||
# URL-Verifizierung nur fuer WebSearch-Ergebnisse (RSS-URLs sind bereits verifiziert)
|
||||
if search_results:
|
||||
|
||||
In neuem Issue referenzieren
Einen Benutzer sperren