feat(rss/telegram): sprach-aware Keyword-Matching für nicht-lateinische Quellen
Bisher generierte Haiku Keywords nur in DE/EN/Romaji. Japanische RSS-Feeds
(z.B. MOD-GNews mit "防衛省・自衛隊の宇宙政策") matchten daher nie, weil
"jieitai" ≠ "自衛隊". Arabische/persische Telegram-Channels matchten nur
durch Zufall (lateinische Eigennamen in Hashtags/URLs).
Drei zusammenhängende Änderungen:
1. get_feeds_with_metadata liefert primary_language pro Feed mit.
2. FEED_SELECTION_PROMPT_TEMPLATE und KEYWORD_EXTRACTION_PROMPT verlangen
sprach-gruppierte Keywords ({de:[...], en:[...], ja:[...], ru:[...], ...}).
"en" enthält lateinische Eigennamen (universell). Andere Sprachen werden
nur gegen Feeds derselben Sprache gematcht.
3. RSS- und Telegram-Parser kombinieren pro Feed/Channel die "en"-Universalbegriffe
mit den Keywords der Quellsprache. Die Spezifik-Schwelle (1-Treffer-Match)
greift jetzt auch ab 3 Zeichen bei Non-ASCII (CJK, Arabisch, Kyrillisch).
Backward-kompatibel: flache Keyword-Listen werden weiter akzeptiert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dieser Commit ist enthalten in:
@@ -61,37 +61,49 @@ class TelegramParser:
|
||||
return None
|
||||
|
||||
async def search_channels(self, search_term: str, tenant_id: int = None,
|
||||
keywords: list[str] = None, channel_ids: list[int] = None) -> list[dict]:
|
||||
keywords: dict | list = None, channel_ids: list[int] = None) -> list[dict]:
|
||||
"""Liest Nachrichten aus konfigurierten Telegram-Kanaelen.
|
||||
|
||||
Args:
|
||||
keywords: Sprach-Dict {iso_lang: [keyword,...]} oder flache Liste (Backward).
|
||||
Match nutzt pro Kanal die "en"-Universalbegriffe + die Keywords der
|
||||
Kanalsprache (primary_language aus sources-Tabelle).
|
||||
|
||||
Gibt Artikel-Dicts zurueck (kompatibel mit RSS-Parser-Format).
|
||||
"""
|
||||
from agents.researcher import keywords_for_language
|
||||
|
||||
client = await self._get_client()
|
||||
if not client:
|
||||
logger.warning("Telegram-Client nicht verfuegbar, ueberspringe Telegram-Pipeline")
|
||||
return []
|
||||
|
||||
# Telegram-Kanaele aus DB laden
|
||||
# Telegram-Kanaele aus DB laden (inkl. primary_language)
|
||||
channels = await self._get_telegram_channels(tenant_id, channel_ids=channel_ids)
|
||||
if not channels:
|
||||
logger.info("Keine Telegram-Kanaele konfiguriert")
|
||||
return []
|
||||
|
||||
# Suchwoerter vorbereiten
|
||||
if keywords:
|
||||
search_words = [w.lower().strip() for w in keywords if w.strip()]
|
||||
else:
|
||||
search_words = [
|
||||
# Fallback-Suchwoerter wenn keine Keywords da sind
|
||||
fallback_words: list[str] | None = None
|
||||
if not keywords:
|
||||
fallback_words = [
|
||||
w for w in search_term.lower().split()
|
||||
if w not in STOP_WORDS and len(w) >= 3
|
||||
]
|
||||
if not search_words:
|
||||
search_words = search_term.lower().split()[:2]
|
||||
if not fallback_words:
|
||||
fallback_words = search_term.lower().split()[:2]
|
||||
|
||||
# Kanaele parallel abrufen
|
||||
tasks = []
|
||||
for ch in channels:
|
||||
channel_id = ch["url"] or ch["name"]
|
||||
channel_lang = ch.get("primary_language")
|
||||
if keywords:
|
||||
search_words = keywords_for_language(keywords, channel_lang)
|
||||
search_words = [w.lower() for w in search_words]
|
||||
else:
|
||||
search_words = fallback_words or []
|
||||
tasks.append(self._fetch_channel(client, channel_id, search_words))
|
||||
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
@@ -115,7 +127,7 @@ class TelegramParser:
|
||||
if channel_ids and len(channel_ids) > 0:
|
||||
placeholders = ",".join("?" for _ in channel_ids)
|
||||
cursor = await db.execute(
|
||||
f"""SELECT id, name, url, category, notes FROM sources
|
||||
f"""SELECT id, name, url, category, notes, primary_language FROM sources
|
||||
WHERE source_type = 'telegram_channel'
|
||||
AND status = 'active'
|
||||
AND id IN ({placeholders})""",
|
||||
@@ -123,7 +135,7 @@ class TelegramParser:
|
||||
)
|
||||
else:
|
||||
cursor = await db.execute(
|
||||
"""SELECT id, name, url, category, notes FROM sources
|
||||
"""SELECT id, name, url, category, notes, primary_language FROM sources
|
||||
WHERE source_type = 'telegram_channel'
|
||||
AND status = 'active'
|
||||
AND (tenant_id IS NULL OR tenant_id = ?)""",
|
||||
|
||||
In neuem Issue referenzieren
Einen Benutzer sperren