You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: update Gemini URL scraper prompt to extract exact content instead of summaries
- Changed prompt to explicitly request original text content without rewriting
- Added clear instructions to preserve all claims and statements as written
- This ensures content can be properly indexed for fact-checking purposes
Co-authored-by: Johnson Liang <[email protected]>
Copy file name to clipboardExpand all lines: src/util/geminiUrlScraper.js
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -41,7 +41,7 @@ export default async function scrapeUrlsWithGemini(urls) {
41
41
role: 'user',
42
42
parts: [
43
43
{
44
-
text: `Please analyze the content at these URLs and extract information from each:
44
+
text: `Please analyze the content at these URLs and extract the exact original text content from each:
45
45
46
46
${urlList}
47
47
@@ -51,16 +51,16 @@ For each URL, extract and return a JSON array with objects having the following
51
51
"url": "original URL from the list",
52
52
"canonical": "canonical URL if different from original, or same as original",
53
53
"title": "The main title of the page",
54
-
"summary": "A comprehensive summary of the content that captures the key information for fact-checking purposes",
54
+
"summary": "The exact original text content from the page without any rewriting, summarization, or paraphrasing - preserve all claims, statements, and information as written in the source",
55
55
"topImageUrl": "URL of the most representative image on the page, or null if none exists"
56
56
}
57
57
]
58
58
59
59
Requirements:
60
60
- url: Return the exact original URL from the input list
61
61
- canonical: Extract the canonical URL from meta tags or use the original URL if no canonical is found
62
-
- title: Extract the main page title
63
-
- summary: Should be detailed enough for search and fact-checking, capturing all important claimsand information
62
+
- title: Extract the main page title exactly as it appears
63
+
- summary: Extract the complete original text content from the page - DO NOT summarize, rewrite, or paraphrase. Keep all original claims, statements, facts, and information exactly as written in the source material. This text will be used for indexing and search purposes.
64
64
- topImageUrl: Find the most representative image (not logos, ads, or decorative images), return null if no suitable image exists
65
65
- Return valid JSON array only, no markdown code blocks or explanations
66
66
- Process all URLs and return results for each, even if some fail`,
0 commit comments