Skip to content

Releases: yusufkaraaslan/Skill_Seekers

v3.7.0

30 May 20:46

Choose a tag to compare

[3.7.0] - 2026-05-30

Theme: AI-driven project knowledge base (skill-seekers scan) — bootstrap a complete skill set for a project in one command, with safety/observability/coverage hardening throughout.

Added

  • skill-seekers scan <dir> command (#327) — point at any project; an AI agent inspects manifests, README, Dockerfile/CI, sampled source files (first 2 KB each), and the git remote, then emits one Skill Seekers config per detected framework plus a <project>-codebase.json for the project's own code. Each config stamped with metadata.detected_version so re-scans report added / version-bumped / removed dependencies. Internationalized canonical-name resolver (CJK + EU language suffixes) so detections like "Godot 引擎" resolve godot. Out-dir cache means re-scans reuse prior emissions and respect manual edits. Doctor-style report with pluralized counts and resolved / AI-generated / unresolved / archived breakdown.
  • Coverage: scan recognizes ~50 manifest types (Pipfile, environment.yml, deno.json, flake.nix, Chart.yaml, stack.yaml, deps.edn, dune-project, BUILD.bazel, …) and walks src/lib/app/cmd/crates/packages/apps/services/backend/frontend plus root-level files (catches Django, flat-layout Python, Go, Rust workspaces, JS monorepos).
  • Cost + safety flags: --max-ai-generations N (default 10) caps unbounded AI generation for monorepos; --dry-run previews what would be emitted without writing or invoking AI; --probe-urls HEAD-probes AI-generated URLs with retry-on-404; --no-fetch / --no-generate / --no-publish-prompt for offline / CI use.
  • Community submission (opt-in): freshly AI-generated configs can be submitted to the community registry via a native-async flow. Pre-checks GITHUB_TOKEN, idempotency-guards against duplicate issues, retries transient failures with backoff.
  • Archival: configs that disappear from detections are moved (not deleted) to out_dir/.archived/<UTC-timestamp>/ so the user never loses hand-edited work and out_dir stays clean.
  • Docs: new docs/getting-started/05-scan-a-project.md; entries in README, FAQ, CLI Reference, Feature Matrix, Config Format, Environment Variables, and the Quick Start cross-link.

Changed

  • CLI dispatch unified (#327) — scan and doctor now consume the parsed-args namespace directly via Command(args).execute() instead of building a second argparse.ArgumentParser. Eliminates the _reconstruct_argv hack for these commands; remaining ~14 commands flagged for migration.
  • Config schema: detected_version lives under metadata.detected_version (alongside metadata.version for the config-schema version) rather than at top level. Backwards-compatible reader; old top-level placements migrate on next stamp.
  • SourceDetector.CODE_PROJECT_MARKERS is now public (was _CODE_PROJECT_MARKERS); cross-module callers no longer reach into a private attribute.

Fixed

  • Correctness (#327) — diff layer keyed by stable filename slug instead of internal config name (eliminates phantom add/remove churn); resolve_config_path lookups now append .json so local-disk + user-dir paths actually find files; out-dir cache prevents redundant API/AI calls on re-scan; lowercase filename slugs prevent duplicate-file accumulation across runs.
  • Safety (#327) — atomic JSON writes via os.replace so SIGINT mid-write can't corrupt a config and silently flip it to "removed" on the next scan; _safe_size guards stat() so a broken symlink in src/ no longer crashes the scan; AgentClient.call exceptions caught and logged; AI-generated config names rejected if they fail the registry regex; URL probe catches AI hallucinations of base_url before writing.
  • Observability (#327) — logging.basicConfig in scan so logger.warning/error reaches the user (was silently dropped); non-zero exit code when no configs and no codebase config were emitted, so CI pipelines detect total-failure scans.
  • Publish flow (#327) — native async (asyncio.run at single entry, asyncio.to_thread for input()); pre-check GITHUB_TOKEN with actionable hint instead of asking N "yes/no" questions and failing N times; idempotency check (search existing open issues) prevents duplicate submissions; retry with backoff on transient failures; nested-event-loop detection with clear message instead of opaque traceback.

v3.6.0

03 May 10:54

Choose a tag to compare

[3.6.0] - 2026-05-03

Theme: Quality-of-life release — packaging targets, GitHub issue workflow, codebase analysis fixes, and source detection hardening.

Added

  • IBM Bob packaging target — new --target bob adaptor and agent install support for IBM's Bob agent platform (#366)
  • GitHub issue filtering--github-issue-state, --github-issue-labels, and --github-issue-since filters in the GitHub scraper for narrowing which issues are pulled (#367)
  • Per-issue files — GitHub scraper now writes one Markdown file per issue instead of a single bundle, improving navigation and downstream chunking (#367)
  • Pinecone frontmatter — Pinecone vector exports now include consistent YAML frontmatter for metadata round-tripping (#367)

Fixed

  • Unified scraper now generates codebase_analysis/ index — local sources were producing C3.x outputs with broken SKILL.md links; the unified skill builder now wires up the index and resolves links correctly (#362, #376)
  • Guides fallback fires correctlyunified_skill_builder was emitting a truthy placeholder for empty guides which suppressed the fallback content; placeholder removed (#364, #375)
  • HTML URLs no longer treated as local filessource_detector now checks for http(s):// before falling through to the local-path branch, fixing false-positive routing (#373)
  • PDF extracted images appear in markdownpdf_scraper now inserts ![](…) references for images extracted from PDFs so they render in the generated SKILL.md (#369)
  • C3.x output for local sourcesunified command was skipping the C3.x analysis pipeline for local codebase sources; now emits the full pattern/test/guide/config/router output (#363, #372)
  • Language filter passed to C3.x clone analysis — repos cloned for analysis now respect --languages instead of analyzing every file (fixes #361, #370)
  • Unity vs Unreal detection — Unity projects with C# imports were being misidentified as Unreal; detection now keys on C# import patterns (fixes #365, #368)

v3.5.1

12 Apr 19:00

Choose a tag to compare

[3.5.1] - 2026-04-12

Added

  • Centralized defaults.json config — single source of truth for all default values (rate_limit, max_pages, workers, async_mode, enhancement, analysis, RAG settings). New defaults.py loader module. All 15+ files that previously hardcoded defaults now read from this file (#356)
  • Low-signal code snippet filtering_is_low_signal_code_snippet() filters junk patterns like bare True, options, single identifiers from quick references (#360)
  • Pattern description normalization_normalize_pattern_description() cleans boilerplate prefixes and truncates to first meaningful sentence (#360)
  • Example language priority ranking_example_language_priority() ranks Python > Bash > JSON > etc. for SKILL.md examples (#360)
  • checkpoint_exists() method on DocToSkillConverter — was called but never defined (#360)
  • Unified config source normalizationDocToSkillConverter.__init__ merges fields from sources[0] into flat config for compatibility (#360)
  • display_name support in SKILL.md generation — produces cleaner titles and slugs (#360)
  • New tests: test_doc_scraper_entrypoint.py (regression for _run_scraping), quick-reference quality tests, docs-only compatibility tests, nested reference coverage tests (#360)

Changed

  • max_pages default is now unlimited (-1) — the scraper fetches all pages unless the user explicitly sets --max-pages. Previously defaulted to 500 (#356)
  • --no-rate-limit flag now works — was defined in CLI arguments but never consumed by ExecutionContext (#356)
  • constants.py reads from defaults.json — no longer contains hardcoded magic numbers (#356)
  • ExecutionContext.ScrapingSettingsrate_limit and max_pages now use real defaults instead of None, preventing None-poisoning downstream (#356)
  • SKILL.md frontmatter cleanup — empty doc_version: and version: fields are now omitted; placeholder sections removed (#360)
  • Enhancement routing through platform adaptors instead of importing nonexistent enhance_skill_md helper (#360)
  • quality_metrics.py uses rglob for nested reference directories in unified skills (#360)

Fixed

  • TypeError: '>' not supported between instances of 'NoneType' and 'int'rate_limit defaulted to None in ExecutionContext, which flowed through config.get("rate_limit", DEFAULT) (dict.get returns None when the key exists with value None, ignoring the fallback). Fixed in doc_scraper.py (sync + async paths), estimate_pages.py, and sync_config.py (#356, #359)
  • discover_urls() loop never executed with unlimited max_pageslen(discovered) < -1 is always False. Added unlimited mode guard (#356)
  • converter.scrape() called nonexistent method in _run_scraping() — changed to converter.scrape_all() (#360)
  • None-safety for BeautifulSoup attributeslink["href"], sitemap.text, meta_desc["content"] guarded against None XML text nodes (#360)
  • Python 3.10 compatibility — backslash in f-string in quality_metrics.py not supported before 3.12 (#360)

v3.5.0

11 Apr 13:00

Choose a tag to compare

[3.5.0] - 2026-04-09

Theme: Grand Unification — one command, one interface, direct converters. Agent-agnostic architecture, marketplace pipeline, smart SPA discovery, all content extraction enabled by default. 80+ files changed across the codebase.

Added

  • Grand Unification — unified create command as single entry point for all 18 source types with auto-detection, direct converter invocation, and centralized enhancement (#346)
  • Agent-agnostic AgentClient abstraction — all 5 enhancers now support Claude, Kimi, Codex, Copilot, OpenCode, and custom agents via a unified interface. Auto-detects agent from API keys instead of hardcoding (#336)
  • Kimi CLI integration with stdin piping and output parsing (#336)
  • MarketplacePublisher — publish skills to Claude Code plugin marketplace repos (#336)
  • MarketplaceManager — register and manage marketplace repositories (#336)
  • ConfigPublisher — push configs to registered config source repos (#336)
  • push_config MCP tool for automated config publishing (#336)
  • Smart SPA discovery engine — three-layer discovery: sitemap.xml, llms.txt, SPA nav rendering (#336)
  • "browser": true config support for JavaScript SPA sites with browser renderer timeout defaults (60s, domcontentloaded) (#336)
  • Dynamic routing via _build_argv() — replaced manual arg forwarding with dynamic forwarder, added 7 missing CLI flags (#336)
  • Kotlin language support for codebase analysis — Full C3.x pipeline support: AST parsing (classes, objects, functions, data/sealed classes, extension functions, coroutines), dependency extraction, design pattern recognition (object declaration→Singleton, companion object→Factory, sealed class→Strategy), test example extraction (JUnit, Kotest, MockK, Spek), language detection patterns, config detection (build.gradle.kts), and extension maps across all analyzers (#287)
  • Headless browser rendering (--browser flag) — uses Playwright to render JavaScript SPA sites (React, Vue, etc.) that return empty HTML shells. Auto-installs Chromium on first use. Optional dep: pip install "skill-seekers[browser]" (#321)
  • skill-seekers doctor command — 8 diagnostic checks (Python version, package install, git, core/optional deps, API keys, MCP server, output dir) with pass/warn/fail status and --verbose flag (#316)
  • Prompt injection check workflow — bundled prompt-injection-check workflow scans scraped content for injection patterns (role assumption, instruction overrides, delimiter injection, hidden instructions). Added as first stage in default and security-focus workflows. Flags suspicious content without removing it (#324)
  • Codex CLI plugin manifest (.codex-plugin/plugin.json) for OpenAI Codex integration (#350)
  • 6 behavioral UML diagrams — 3 sequence (create pipeline, GitHub+C3.x flow, MCP invocation), 2 activity (source detection, enhancement pipeline), 1 component (runtime dependencies with interface contracts)
  • 134 new teststest_agent_client.py, test_config_publisher.py, _build_argv tests. Total: 3194 passed, 39 expected skips (#336)

Changed

  • All content extraction features enabled by default — pattern detection, test examples, how-to guides, config extraction, and router generation no longer require explicit opt-in
  • Renamed claude-enhanced merge mode to ai-enhanced — backward compatibility alias kept (#336)
  • Removed 118+ hardcoded Claude references across 60+ files (#336)
  • Refactored 5 enhancers to use AgentClient abstraction (#336)
  • Removed 50-file GitHub API analysis limit (#336)
  • Removed 100-file config extraction limit (#336)
  • Fixed unified scraper default max_pages from 100 to 500 (#336)
  • Centralized enhancement timeouts to 45min default with unlimited support (#336)
  • Excluded slow MCP/e2e tests from CI coverage step to prevent timeout

Fixed

  • glob('*.md') replaced with rglob('*.md') in all adaptors — fixes packaging when skills are in nested directories (#349)
  • scraped_data list-vs-dict bug in conflict detection (#336)
  • base_url passthrough to doc scraper subprocess (#336)
  • URL filtering now uses base directory correctly (#336)
  • C3.x analysis data loss (#336)
  • --enhance-level flag not passed correctly (#336)
  • guide_enhancer method rename_call_claude_api renamed to _call_ai (#336)
  • 11 pre-existing test failures fixed (#336)
  • Per-file language detection in GitHub scraper (#336)
  • GitHub language detection crashes with TypeError when API response contains non-integer metadata keys (e.g., "url") — now filters to integer values only (#322)
  • C3.x codebase analysis crashes with TypeError_run_c3_analysis() and _analyze_c3x() passed removed enhance_with_ai/ai_mode kwargs to analyze_codebase() instead of enhance_level (#323)

Security

  • Removed command injection via cloned repo script execution (#336)
  • Replaced git add -A with targeted staging in marketplace publisher (#336)
  • Clear auth tokens from cached .git/config after clone (#336)
  • Use defusedxml for sitemap XML parsing (XXE protection) (#336)
  • Path traversal validation for config names (#336)

v3.4.0 — 12 LLM Platforms, SPA Detection, UML Architecture

25 Mar 19:21

Choose a tag to compare

What's New in v3.4.0

Theme: 8 new LLM platform adaptors (12 total), 7 new CLI agent paths (18 total), OpenCode skill tools, SPA site detection, 8 bug fixes, and full UML architecture documentation.

Platform Expansion: 5 → 12 LLM Targets

New Platform Flag Base
OpenCode --target opencode Directory-based, dual YAML
Kimi --target kimi OpenAI-compatible
DeepSeek --target deepseek OpenAI-compatible
Qwen --target qwen OpenAI-compatible
OpenRouter --target openrouter OpenAI-compatible
Together AI --target together OpenAI-compatible
Fireworks AI --target fireworks OpenAI-compatible

All new platforms inherit from a shared OpenAI-compatible base class for consistent behavior.

Agent Expansion: 11 → 18 Install Paths

New agents: roo, cline, aider, bolt, kilo, continue, kimi-code

OpenCode Skill Tools

  • Skill splitter — auto-split large docs into focused sub-skills with router
  • Bi-directional converter — import/export between OpenCode and any platform format

Distribution

  • Smithery manifest (smithery.yaml)
  • GitHub Actions template for automated skill updates
  • Claude Code Plugin with slash commands

Bug Fixes

  • sanitize_url() crash on Python 3.14 strict urlparse (#284)
  • Blind /index.html.md append breaking non-Docusaurus sites (#277)
  • Unified scraper temp config format (#317)
  • Unicode arrows breaking Windows cp1252 terminals
  • CLI flags in plugin slash commands
  • MiniMax adaptor improvements (#319)
  • Misleading "Scraped N pages" count — now shows (N saved, M skipped) (#320)
  • SPA site detection — warns when site requires JavaScript rendering (#320, #321)

Documentation

  • Full UML architecture — 14 class diagrams synced from source code via StarUML
  • StarUML HTML API reference export
  • Ecosystem section linking all Skill Seekers repos
  • Architecture references in README and CONTRIBUTING
  • Consolidated Docs/ into docs/

Test Results

2929 passed, 39 skipped, 0 failures

Install / Upgrade

pip install --upgrade skill-seekers

Full changelog: https://github.com/yusufkaraaslan/Skill_Seekers/blob/main/CHANGELOG.md

v3.3.0

15 Mar 22:27

Choose a tag to compare

[3.3.0] - 2026-03-16

Theme: 10 new source types (17 total), EPUB unified integration, sync-config command, performance optimizations, 12 README translations, and 19 bug fixes. 117 files changed, +41,588 lines since v3.2.0.

Supported Source Types (17)

# Type CLI Command Config Type Auto-Detection
1 Documentation (web) scrape / create <url> documentation HTTP/HTTPS URLs
2 GitHub repository github / create owner/repo github owner/repo or github.com URLs
3 PDF document pdf / create file.pdf pdf .pdf extension
4 Word document word / create file.docx word .docx extension
5 EPUB e-book epub / create file.epub epub .epub extension
6 Video video / create <url/file> video YouTube/Vimeo URLs, video extensions
7 Local codebase analyze / create ./path local Directory paths
8 Jupyter Notebook jupyter / create file.ipynb jupyter .ipynb extension
9 Local HTML html / create file.html html .html/.htm extensions
10 OpenAPI/Swagger openapi / create spec.yaml openapi .yaml/.yml with OpenAPI content
11 AsciiDoc asciidoc / create file.adoc asciidoc .adoc/.asciidoc extensions
12 PowerPoint pptx / create file.pptx pptx .pptx extension
13 RSS/Atom feed rss / create feed.rss rss .rss/.atom extensions
14 Man pages manpage / create cmd.1 manpage .1.8/.man extensions
15 Confluence wiki confluence confluence API or export directory
16 Notion pages notion notion API or export directory
17 Slack/Discord chat chat chat Export directory or API

Added

10 New Skill Source Types (17 total)

Skill Seekers now supports 17 source types — up from 7. Every new type is fully integrated into the CLI (skill-seekers <type>), create command auto-detection, unified multi-source configs, config validation, the MCP server, and the skill builder.

  • Jupyter Notebookskill-seekers jupyter --notebook file.ipynb or skill-seekers create file.ipynb

    • Extracts markdown cells, code cells with outputs, kernel metadata, imports, and language detection
    • Handles single files and directories of notebooks; filters .ipynb_checkpoints
    • Optional dependency: pip install "skill-seekers[jupyter]" (nbformat)
    • Entry point: skill-seekers-jupyter
  • Local HTMLskill-seekers html --html-path file.html or skill-seekers create file.html

    • Parses HTML using BeautifulSoup with smart main content detection (<article>, <main>, .content, largest div)
    • Extracts headings, code blocks, tables (to markdown), images, links; converts inline HTML to markdown
    • Handles single files and directories; supports .html, .htm, .xhtml extensions
    • No extra dependencies (BeautifulSoup is a core dep)
  • OpenAPI/Swaggerskill-seekers openapi --spec spec.yaml or skill-seekers create spec.yaml

    • Parses OpenAPI 3.0/3.1 and Swagger 2.0 specs from YAML or JSON (local files or URLs via --spec-url)
    • Extracts endpoints, parameters, request/response schemas, security schemes, tags
    • Resolves $ref references with circular reference protection; handles allOf/oneOf/anyOf
    • Groups endpoints by tags; generates comprehensive API reference markdown
    • Source detection sniffs YAML file content for openapi: or swagger: keys (avoids false positives on non-API YAML files)
    • Optional dependency: pip install "skill-seekers[openapi]" (pyyaml — already a core dep, guard added for safety)
  • AsciiDocskill-seekers asciidoc --asciidoc-path file.adoc or skill-seekers create file.adoc

    • Regex-based parser (no external library required) with optional asciidoc library support
    • Extracts headings (= through =====), [source,lang] code blocks, |=== tables, admonitions (NOTE/TIP/WARNING/IMPORTANT/CAUTION), and include:: directives
    • Converts AsciiDoc formatting to markdown; handles single files and directories
    • Optional dependency: pip install "skill-seekers[asciidoc]" (asciidoc library for advanced rendering)
  • PowerPoint (.pptx)skill-seekers pptx --pptx file.pptx or skill-seekers create file.pptx

    • Extracts slide text, speaker notes, tables, images (with alt text), and grouped shapes
    • Detects code blocks by monospace font analysis (30+ font families)
    • Groups slides into sections by layout type; handles single files and directories
    • Optional dependency: pip install "skill-seekers[pptx]" (python-pptx)
  • RSS/Atom Feedsskill-seekers rss --feed-url <url> / --feed-path file.rss or skill-seekers create feed.rss

    • Parses RSS 2.0, RSS 1.0, and Atom feeds via feedparser
    • Optionally follows article links (--follow-links, default on) to scrape full page content using BeautifulSoup
    • Extracts article titles, summaries, authors, dates, categories; configurable --max-articles (default 50)
    • Source detection matches .rss and .atom extensions (.xml excluded to avoid false positives)
    • Optional dependency: pip install "skill-seekers[rss]" (feedparser)
  • Man Pagesskill-seekers manpage --man-names git,curl / --man-path dir/ or skill-seekers create git.1

    • Extracts man pages by running man command via subprocess or reading .1.8/.man files directly
    • Handles gzip/bzip2/xz compressed man files; strips troff/groff formatting (backspace overstriking, macros, font escapes)
    • Parses structured sections (NAME, SYNOPSIS, DESCRIPTION, OPTIONS, EXAMPLES, SEE ALSO)
    • Source detection uses basename heuristic to avoid false positives on log rotation files (e.g., access.log.1)
    • No external dependencies (stdlib only)
  • Confluenceskill-seekers confluence --base-url <url> --space-key <key> or --export-path dir/

    • API mode: fetches pages from Confluence REST API with pagination (atlassian-python-api)
    • Export mode: parses Confluence HTML/XML export directories
    • Extracts page content, code/panel/info/warning macros, page hierarchy, tables
    • Optional dependency: pip install "skill-seekers[confluence]" (atlassian-python-api)
  • Notionskill-seekers notion --database-id <id> / --page-id <id> or --export-path dir/

    • API mode: fetches pages via Notion API with support for 20+ block types (paragraph, heading, code, callout, toggle, table, etc.)
    • Export mode: parses Notion Markdown/CSV export directories
    • Extracts rich text with annotations (bold, italic, code, links), 16+ property types for database entries
    • Optional dependency: pip install "skill-seekers[notion]" (notion-client)
  • Slack/Discord Chatskill-seekers chat --export-path dir/ or --token <token> --channel <channel>

    • Slack: parses workspace JSON exports or fetches via Slack Web API (slack_sdk)
    • Discord: parses DiscordChatExporter JSON or fetches via Discord HTTP API
    • Extracts messages, code snippets (fenced blocks), shared URLs, threads, reactions, attachments
    • Generates per-channel summaries and topic categorization
    • Optional dependency: pip install "skill-seekers[chat]" (slack-sdk)

EPUB Unified Pipeline Integration

  • EPUB (.epub) input support via skill-seekers create book.epub or skill-seekers epub --epub book.epub
    • Extracts chapters, metadata (Dublin Core), code blocks, images, and tables from EPUB 2 and EPUB 3 files
    • DRM detection with clear error messages (Adobe ADEPT, Apple FairPlay, Readium LCP)
    • Font obfuscation correctly identified as non-DRM
    • EPUB 3 TOC bug workaround (ignore_ncx option)
    • --help-epub flag for EPUB-specific help
    • Optional dependency: pip install "skill-seekers[epub]" (ebooklib)
    • 107 tests across 14 test classes
  • EPUB added to unified scraper_scrape_epub() method, scraped_data["epub"], config validation (_validate_epub_source), and dry-run display. Previously EPUB worked standalone but was missing from multi-source configs.

Unified Skill Builder — Generic Merge System

  • _generic_merge() — Priority-based section merge for any combination of source types not covered by existing pairwise synthesis (docs+github, docs+pdf, etc.). Produces YAML frontmatter + source-attributed sections.
  • _append_extra_sources() — Appends additional source type content (e.g., Jupyter + PPTX) to pairwise-synthesized SKILL.md.
  • _generate_generic_references() — Generates references/<type>/index.md for any source type, with ID resolution fallback chain.
  • _SOURCE_LABELS dict — Human-readable labels for all 17 source types used in merge attribution.

Config Validator Expansion

  • 17 source types in VALID_SOURCE_TYPES — All new types plus word and video now have per-type validation methods.
  • _validate_word_source() — Validates path field for Word documents (was previously missing).
  • _validate_video_source() — Validates url, path, or playlist field for video sources (was previously missing).
  • 11 new _validate_*_source() methods — One for each new type with appropriate required-field checks.

Source Detection Improvements

  • 7 new file extension detections in SourceDetector.detect().ipynb, .html/.htm, .pptx, .adoc/.asciidoc, .rss/.atom, .1.8/.man, .yaml/.yml (with content sniffing)
  • _looks_like_openapi() — Content sniffing for YAML files: only classifies as OpenAPI if the file contains openapi: or swagger: key in first 20 lines (prevents false positives on docker-compose, Ansible, Kubernetes manifests, etc.)
  • Man page basename heuristic.1.8 extensions only detected as man pages if the basename has no dots (e.g., git.1 matches but access.log.1 does not)
  • .xml excluded from RSS detection — Too generic; only...
Read more

v3.2.0 — Video Extraction, Word Support, Pinecone Adaptor

02 Mar 09:44

Choose a tag to compare

v3.2.0 — Video Extraction, Word Support, Pinecone Adaptor

Theme: Video source support, Word document support, Pinecone adaptor, and quality improvements. 94 files changed, +23,500 lines since v3.1.3. 2,540 tests passing.

🎬 Video Extraction Pipeline

Complete video extraction system that converts YouTube videos and local video files into AI-consumable skills.

  • skill-seekers video --url <youtube-url> — New CLI command for video scraping
  • skill-seekers create <youtube-url> — Auto-detects YouTube URLs
  • Transcript extraction — 3-tier fallback: YouTube API → yt-dlp → faster-whisper
  • Visual OCR — Multi-engine ensemble (EasyOCR + pytesseract) for code frames
  • Panel detection — Splits IDE screenshots into independent sub-sections
  • Code timeline — Tracks code evolution across frames with edit history
  • Two-pass AI enhancement — Cleans OCR noise using transcript context
  • GPU auto-detectionskill-seekers video --setup detects CUDA/ROCm/CPU and installs correct PyTorch
  • 197 tests covering models, metadata, transcript, visual, OCR, and CLI

📄 Word Document (.docx) Support

  • skill-seekers word --docx <file> — Full pipeline: mammoth → HTML → sections → SKILL.md
  • skill-seekers create document.docx — Auto-detects .docx files
  • Smart code detection — Identifies monospace paragraphs as code blocks
  • Install: pip install skill-seekers[docx]

🌲 Pinecone Vector Database Adaptor

  • skill-seekers package output/ --format pinecone --upload — Direct Pinecone upload
  • Full CRUD operations with namespace support
  • OpenAI and Sentence Transformers embedding support
  • Batch upsert with configurable batch sizes
  • 764 tests for comprehensive coverage

🐛 Bug Fixes

  • 6 OCR quality fixes — Skip webcam frames, clean IDE decorations, fix duplicate lines, filter UI junk
  • 15 video pipeline fixes — Timeout handling, MCP integration, filename collisions, dependency management
  • Issue #300 — Selector fallback & dry-run link discovery (ReactFlow found 20+ pages, was 1)
  • Issue #301setup.sh macOS fix
  • RAG chunking crash — Fixed AttributeError: output_dir
  • Chunk overlap auto-scaling — Scales to max(50, chunk_tokens // 10)
  • Reference file limits removed — No more caps on GitHub issues, releases, or code blocks
  • See CHANGELOG.md for full details

📦 Install / Upgrade

pip install --upgrade skill-seekers

# With video support
pip install skill-seekers[video]
skill-seekers video --setup  # Auto-detect GPU, install deps

# With Word support
pip install skill-seekers[docx]

# With Pinecone
pip install skill-seekers[pinecone]

# Everything
pip install skill-seekers[all]

Full Changelog: https://github.com/yusufkaraaslan/Skill_Seekers/blob/main/CHANGELOG.md

v3.1.3

24 Feb 19:57

Choose a tag to compare

[3.1.3] - 2026-02-24

🐛 Hotfix — Explicit Chunk Flags & Argument Pipeline Cleanup

Fixed

  • Issue #299: skill-seekers package --target claude unrecognised argument crash_reconstruct_argv() in main.py emits default flag values back into argv when routing subcommands. package_skill.py had a 105-line inline argparser that used different flag names to those in arguments/package.py, so forwarded flags were rejected. Fixed by replacing the inline block with a call to add_package_arguments(parser) — the single source of truth.

Changed

  • package_skill.py argparser refactored — Replaced ~105 lines of inline argparse duplication with a single add_package_arguments(parser) call. Flag names are now guaranteed consistent with _reconstruct_argv() output, preventing future argument-name drift.
  • Explicit chunk flag names — All --chunk-* flags now include unit suffixes to eliminate ambiguity between RAG tokens and streaming characters:
    • --chunk-size (RAG tokens) → --chunk-tokens
    • --chunk-overlap (RAG tokens) → --chunk-overlap-tokens
    • --chunk (enable RAG chunking) → --chunk-for-rag
    • --streaming-chunk-size (chars) → --streaming-chunk-chars
    • --streaming-overlap (chars) → --streaming-overlap-chars
    • --chunk-size in PDF extractor (pages) → --pdf-pages-per-chunk
  • setup_logging() centralized — Added setup_logging(verbose, quiet) to utils.py and removed 4 duplicate module-level logging.basicConfig() calls from doc_scraper.py, github_scraper.py, codebase_scraper.py, and unified_scraper.py

v3.1.2 — Gemini Fix & Enhance Dispatcher

24 Feb 04:09
90e5e8f

Choose a tag to compare

What's Changed

🐛 Critical Bug Fixes

Gemini enhancement 404 errors — The gemini-2.0-flash-exp model was retired by Google, causing all Gemini enhancement requests to fail with 404. Replaced with gemini-2.5-flash (stable GA).

skill-seekers enhance auto-detection — The documented behaviour of automatically using API mode when an API key is present was never implemented. This release fixes it:

  • ANTHROPIC_API_KEY set → Claude API mode
  • GOOGLE_API_KEY set → Gemini API mode
  • OPENAI_API_KEY set → OpenAI API mode
  • No key → LOCAL mode (Claude Code Max, free)

Use --mode LOCAL to force local mode even when API keys are present.

create command argument forwarding — Universal flags (--dry-run, --verbose, --quiet, --name, --description) were crashing when used with GitHub, PDF, and codebase sources. All fixed. Also adds --dry-run support to skill-seekers github and skill-seekers pdf.

Upgrade

pip install --upgrade skill-seekers
docker pull yusufk/skill-seekers:latest

Full Changelog

See CHANGELOG.md for complete details.

v3.1.1

23 Feb 09:12
022b8a4

Choose a tag to compare

What's Changed

Full Changelog: v3.1.0...v3.1.1