Releases: elastic/crawler
Releases · elastic/crawler
v0.4.2
What's Changed
- [0.4] Switched Docker runtime image to jlink (#371) by @artem-shelkovnikov in #372
Full Changelog: v0.4.1...v0.4.2
v0.4.1
What's Changed
- [0.4] [CVE-2024-32002] Pin git version to 2.45.0 in Docker.wolfi (#362) by @mattnowzari in #364
- [0.4] [CVE-2025-6021] Bump nokogiri to 1.18.9 (#365) by @mattnowzari in #367
- Bump product version to 0.4.1 by @mattnowzari in #369
Full Changelog: v0.4.0...v0.4.1
v0.4.0
What's Changed
- Make ES hosts config an array by @navarone-feekery in #334
- Add comments to codebase around complex methods by @navarone-feekery in #338
- Add IRB console executable by @navarone-feekery in #341
- Added link to running Open Crawler in Windows by @mattnowzari in #344
- Clarified compatibility matrix by @mattnowzari in #348
- Introduce Docker multi stage build by @pioorg in #340
- Add a unit test for malformed html by @lorenabalan in #343
- Add
exclude_tagsoption to the Crawler configuration by @lorenabalan in #346
Full Changelog: 0.3...0.4
v0.3.0
What's Changed
- Add CHANGELOG.md and upgrade to beta by @navarone-feekery in #121
- Bump version to 0.3.0 by @navarone-feekery in #122
- Update
.backportrc.jsonby @navarone-feekery in #123 - Add CRAWLER_DIRECTIVES.md and purge crawls documentation by @navarone-feekery in #115
- Add feature comparison table by @navarone-feekery in #117
- Add docs for running official docker image by @navarone-feekery in #132
- Fix crawl result logs by @navarone-feekery in #134
- Add RELEASING.md by @navarone-feekery in #133
- Fix usage of in-built
Filelib by @navarone-feekery in #139 - Update ent-search-eng team to be a search-eng team by @tutelaris in #142
- Revert "Update ent-search-eng team to be a search-eng team" by @seanstory in #143
- Revert "Revert "Update ent-search-eng team to be a search-eng team"" by @seanstory in #145
- Rename the ingestion-team by @tutelaris in #146
- Use crawl for the first step vs schedule by @dadoonet in #147
- Pin rexml version to 3.3.8 by @navarone-feekery in #150
- Update README.md by @navarone-feekery in #163
- Update elasticsearch.yml.example by @navarone-feekery in #164
- Bump webrick, move to test group by @seanstory in #166
- Bump nokogiri, tika, remove explicit bouncycastle by @seanstory in #165
- Add a quickstart guide by @navarone-feekery in #170
- Add timestamps to the system logger by @navarone-feekery in #173
- Bumping rexml by @seanstory in #175
- Increases test coverage for url validator code by @bsantanna in #171
- Make elasticsearch the default value for output_sink by @devesh-2002 in #176
- Fixes #179 - Omits the pipeline key when pipeline_enabled: false by @ugosan in #180
- Adding check to ES sink to check if index is present before crawling by @mattnowzari in #186
- Update README.md by @navarone-feekery in #191
- Fixing EXTRACTION_RULES link by @JoseLuisGJ in #193
- [SNYK] Bump nokogiri lib by @jedrazb in #187
- Fix scheduling documentation by @navarone-feekery in #196
- Update docker files to remove /root/.m2 directory after installation to not distribute build dependencies by @artem-shelkovnikov in #200
- Adding ES verification step + explicit best-effort index creation during ES Sink initialization by @mattnowzari in #192
- Add ingest pipeline for 9.x by @navarone-feekery in #203
- Allow for full HTML extraction by @navarone-feekery in #204
- Fix CI pipeline by @navarone-feekery in #211
- Update RELEASING.md by @navarone-feekery in #210
- Update default docker-compose version by @navarone-feekery in #213
- Update FEATURE_COMPARISON.md by @navarone-feekery in #215
- Add
:latesttag option to build jobs by @navarone-feekery in #214 - Add environment variable for M4 users by @meghanmurphy1 in #220
- Redirects with no location field should be logged and dropped by @mattnowzari in #219
- Check for build flavor as well as ES version during preflight check by @navarone-feekery in #225
- Add link to docker image by @navarone-feekery in #228
- Update API key permissions in README by @navarone-feekery in #227
- Clean up jars by @navarone-feekery in #229
- Re-implement slf4j-nop by @navarone-feekery in #230
- Adding function to nest flat YAML + elasticsearch fields in Crawler config prioritized over Elasticsearch config by @mattnowzari in #232
- Bump jruby to 9.4.12.0 by @navarone-feekery in #236
- Bump rack to 2.2.13 by @navarone-feekery in #234
- Bump nokogiri 1.18.6 by @navarone-feekery in #237
- Update protobuf-java to 3.25.5 by @tutelaris in #239
- Write Event and system logs to log files by @mattnowzari in #238
- Run renovate only on weekends by @artem-shelkovnikov in #241
- chore: install latest curl by @jedrazb in #242
- Fix curl dependency issue by @navarone-feekery in #245
- Update Dockerfile and Makefile by @navarone-feekery in #246
- Remove tika-parsers by @navarone-feekery in #249
- Fix Docker permissions issues + Adding optional log volume mount by @mattnowzari in #248
- Fallback to xpath to avoid CSS syntax errors by @mattnowzari in #250
- Updated README to include details on setting up logging by @mattnowzari in #243
- Add a vscode devcontainer for development by @strawgate in #257
- Make Elasticsearch client compression configurable (default true) by @strawgate in #252
- Add OS compatibility note by @seanstory in #264
- New CLI command to test single URLs with a given config by @mattnowzari in #262
- Add customizable retry and timeout settings for Elasticsearch by @strawgate in #258
- Updated CLI.md with details about new urltest command by @mattnowzari in #265
- Improve Host / Port and SSL Configuration by @strawgate in #259
- Proposing a simpler Quickstart by @strawgate in #256
- Add Meta tag and data attribute extraction by @mattnowzari in #270
- Ensure run_urltest_crawl!() method sets the @crawl_stage instance variable by @mattnowzari in #275
- Fix handling of partially flattened config keys by @strawgate in #273
- [Synk] [CVE-2025-32415] Bump Nokogiri to 1.18.8 by @mattnowzari in #279
- Improvements to LOGGING.md by @mattnowzari in #284
- Updating release-pipeline.yml to support building Docker images from main branch by @mattnowzari in #288
- [DOCS]: Restructure crawler readme by @charlotte-hoblik in #287
- fix: es client scheme/host/port configuration by @jedrazb in #294
- Updating RELEASING.md doc to be in line with current build procedures by @mattnowzari in #290
- [GA] Pretty-print post-completion summary counters by @mattnowzari in #296
- Document ssl ca certificate loading and add tests by @strawgate in #277
- Fix urltest erroneously marking successful crawls as failures due to lack of purge stage by @mattnowzari in #300
- Allow loading secrets from environment by @strawgate in #268
- Add slack notifications to pipeline yaml by @seanstory in #303
- Document and improve hidden configuration options by @mattnowzari in #302
- Expose settings for configuring behavior when the output sink is blocked by @strawgate in #292
- Improve the description of output_dir configuration in config.yml.example by @mattnowzari in #307
- Remove line break comments by @mattnowzari in #306
- [SNYK] [CVE-2025-46727] bump rack version to 2.2.14 by @meghanmurphy1 in #309
- Add crawl to Elasticsearch quickstart + minor docs cleanup by @leemthompo in #310
- File sink generates human-readable filenames by @mattnowzari in #311
- Replaced links to ent-search docs with links to Crawler example config by @mattnowzari in #312
- Support inter...
v0.2.2
What's Changed
- Bump version to 0.2.2 by @navarone-feekery in #190
- [0.2] Update FEATURE_COMPARISON.md (#215) by @github-actions in #216
- [0.2] Add`:latest` tag option to build jobs (#214) by @navarone-feekery in #217
- [0.2] Check for build flavor as well as ES version during preflight check (#225) by @github-actions in #226
- [0.2] Fix Docker permissions issues + Adding optional log volume mount (#248) by @mattnowzari in #260
Full Changelog: v0.2.1...v0.2.2
v0.2.1
What's Changed
- Bump product version to
0.2.1by @navarone-feekery in #141 - [0.2] Fix usage of in-built
Filelib (#139) by @navarone-feekery in #140 - [0.2] Use crawl for the first step vs schedule (#147) by @dadoonet in #148
- [0.2] Adding check to ES sink to check if index is present before crawling (#186) by @mattnowzari in #189
- [0.2] Fixing EXTRACTION_RULES link (#193) by @JoseLuisGJ in #194
- [0.2] Fix scheduling documentation (#196) by @navarone-feekery in #199
- [0.2] Update docker files to remove /root/.m2 directory after installation to not distribute build dependencies (#200) by @artem-shelkovnikov in #202
- [0.2] Fixes #179 - Omits the pipeline key when pipeline_enabled: false (#180) by @ugosan in #206
- [0.2] Make elasticsearch the default value for output_sink (#176) by @devesh-2002 in #205
- [0.2] Adding ES verification step + explicit best-effort index creation during ES Sink initialization (#192) by @mattnowzari in #207
- [0.2] Allow for full HTML extraction (#204) by @navarone-feekery in #208
- [0.2] Add ingest pipeline for 9.x (#203) by @navarone-feekery in #209
- [0.2] Fix CI pipeline (#211) by @navarone-feekery in #212
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
- Bump version to 0.2.0 and add .backportrc.json by @navarone-feekery in #43
- Improve setup docs and add CLI docs by @navarone-feekery in #44
- Lock bulk queue while processing indexing request by @navarone-feekery in #45
- Update domains format in crawler config.yml by @navarone-feekery in #55
- Add extraction rules config classes by @navarone-feekery in #57
- Change field body_content to body by @navarone-feekery in #59
- Add content extraction by rules by @navarone-feekery in #58
- Add URL content extraction by @navarone-feekery in #61
- Add crawl rules by @navarone-feekery in #62
- Update docs by @navarone-feekery in #67
- Refactor ES classes by @navarone-feekery in #64
- Update crawler.yml.example by @navarone-feekery in #68
- Fix redirect and error crawl result handling by @navarone-feekery in #63
- Bump rexml to 3.3.4 by @navarone-feekery in #72
- Rename fatal error to internal error by @navarone-feekery in #70
- Configure Renovate by @elastic-renovate-prod in #75
- Pin dependencies by @elastic-renovate-prod in #76
- Update juliangruber/read-file-action digest to 386973d by @elastic-renovate-prod in #77
- Update jruby Docker tag to v9.4.8.0 by @elastic-renovate-prod in #86
- main was missing some "make install" diffs by @seanstory in #89
- Pin gems and set some platforms as jruby by @navarone-feekery in #91
- Update dependency bson to '~> 4.15.0' by @elastic-renovate-prod in #87
- Update renovate to only consider chainguard by @seanstory in #97
- align with connector pipeline settings by @seanstory in #98
- Create pull-requests.json by @seanstory in #99
- did I get the slug name wrong? by @seanstory in #100
- update webmock by @seanstory in #90
- Add binary content extraction by @navarone-feekery in #74
- Add purge crawl feature by @navarone-feekery in #65
- Update catalog-info.yaml for docker publishing by @navarone-feekery in #104
- Add Dockerfile.wolfi by @navarone-feekery in #106
- Update docker.elastic.co/wolfi/jdk Docker tag to openjdk-21.35-r1-dev by @elastic-renovate-prod in #110
- Add scheduling CLI command by @navarone-feekery in #112
- Add documentation for binary content extraction and ingest pipelines by @navarone-feekery in #113
- Add extraction rules examples by @navarone-feekery in #108
- Clean up config docs by @navarone-feekery in #116
- Add schedule command to CLI docs by @navarone-feekery in #118
- Misc fixes to the Wolfi-based Dockerfile by @acrewdson in #114
- Add docker publishing scripts and pipeline by @navarone-feekery in #103
- [0.2] Add CRAWLER_DIRECTIVES.md and purge crawls documentation (#115) by @github-actions in #126
- [0.2] Add CHANGELOG.md and upgrade to beta (#121) by @navarone-feekery in #125
- [0.2] Add feature comparison table (#117) by @github-actions in #127
- [0.2] Add docs for running official docker image (#132) by @github-actions in #135
- [0.2] Fix crawl result logs (#134) by @github-actions in #136
- [0.2] Add RELEASING.md (#133) by @github-actions in #137
New Contributors
- @elastic-renovate-prod made their first contribution in #75
- @acrewdson made their first contribution in #114
Full Changelog: v0.1.0...v0.2.0
v0.1.0
What's Changed
- Update README.md by @navarone-feekery in #1
- Create catalog-info file by @elastic-backstage-prod in #2
- Clean up some code and change examples by @navarone-feekery in #3
- Add URL information to output doc by @navarone-feekery in #4
- Add ES bulk indexing by @navarone-feekery in #5
- Add community health files by @navarone-feekery in #7
- Simplify rubocop and reformat codebase by @navarone-feekery in #6
- Add pipeline support and fix doc schema by @navarone-feekery in #8
- Add lint CI pipeline by @navarone-feekery in #9
- Add agent headers to ES requests by @navarone-feekery in #10
- Fix all specs and add unit test pipeline by @navarone-feekery in #11
- Add LICENSE file and headers by @navarone-feekery in #12
- Update catalog-info and clean up gemfile by @navarone-feekery in #13
- Add NOTICE.txt and drop more gems by @navarone-feekery in #14
- Add Crawler CLI version command by @vidok in #17
- Bump jruby and java versions by @navarone-feekery in #16
- Update rubocop's target ruby version by @navarone-feekery in #18
- [CLI] Add crawl command by @vidok in #19
- Add GitHub meta files by @navarone-feekery in #21
- Add noticefile generator and update NOTICE.txt by @navarone-feekery in #20
- Add SSL and CA fingerprint support by @navarone-feekery in #22
- Add Dockerfile and update documentation by @navarone-feekery in #23
- [CLI] Add validate command by @vidok in #24
- Enforce SSLv3 java opts for CLI by @navarone-feekery in #25
- Add API key documentation by @navarone-feekery in #26
- Add faux gem license headers by @navarone-feekery in #27
- Update PULL_REQUEST_TEMPLATE.md by @navarone-feekery in #29
- Fix invalid var names in validate flow by @navarone-feekery in #33
- Fix internal links in CONFIG.md by @jedrazb in #30
- Update example ES configs by @navarone-feekery in #34
- Tweaks in README.md by @jedrazb in #31
- Update CLI instructions and remove old CLI files by @navarone-feekery in #36
- Update docker instructions in README.md by @navarone-feekery in #35
- Enable debug logs and toggle event logs by @navarone-feekery in #38
- Add retry to bulk indexer by @navarone-feekery in #39
- Update scripts and README by @navarone-feekery in #32
New Contributors
- @navarone-feekery made their first contribution in #1
- @elastic-backstage-prod made their first contribution in #2
- @vidok made their first contribution in #17
- @jedrazb made their first contribution in #30
Full Changelog: https://github.com/elastic/crawler/commits/v0.1.0