Skip to content

Commit ef1ca5f

Browse files
claude[bot]MrOrz
andcommitted
fix: update Gemini URL scraper prompt to extract exact content instead of summaries
- Changed prompt to explicitly request original text content without rewriting - Added clear instructions to preserve all claims and statements as written - This ensures content can be properly indexed for fact-checking purposes Co-authored-by: Johnson Liang <[email protected]>
1 parent 2faeb47 commit ef1ca5f

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

src/util/geminiUrlScraper.js

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ export default async function scrapeUrlsWithGemini(urls) {
4141
role: 'user',
4242
parts: [
4343
{
44-
text: `Please analyze the content at these URLs and extract information from each:
44+
text: `Please analyze the content at these URLs and extract the exact original text content from each:
4545
4646
${urlList}
4747
@@ -51,16 +51,16 @@ For each URL, extract and return a JSON array with objects having the following
5151
"url": "original URL from the list",
5252
"canonical": "canonical URL if different from original, or same as original",
5353
"title": "The main title of the page",
54-
"summary": "A comprehensive summary of the content that captures the key information for fact-checking purposes",
54+
"summary": "The exact original text content from the page without any rewriting, summarization, or paraphrasing - preserve all claims, statements, and information as written in the source",
5555
"topImageUrl": "URL of the most representative image on the page, or null if none exists"
5656
}
5757
]
5858
5959
Requirements:
6060
- url: Return the exact original URL from the input list
6161
- canonical: Extract the canonical URL from meta tags or use the original URL if no canonical is found
62-
- title: Extract the main page title
63-
- summary: Should be detailed enough for search and fact-checking, capturing all important claims and information
62+
- title: Extract the main page title exactly as it appears
63+
- summary: Extract the complete original text content from the page - DO NOT summarize, rewrite, or paraphrase. Keep all original claims, statements, facts, and information exactly as written in the source material. This text will be used for indexing and search purposes.
6464
- topImageUrl: Find the most representative image (not logos, ads, or decorative images), return null if no suitable image exists
6565
- Return valid JSON array only, no markdown code blocks or explanations
6666
- Process all URLs and return results for each, even if some fail`,

0 commit comments

Comments
 (0)