Skip to content

Wrong length-check in summary when using xpath=True results wrong summaries #146

Open
@yeus

Description

@yeus

When trying to use xpath=True in summary to extract the main content, you get the wrong result for several webpages, otherwise its correct.

The reason is that the length check in the summary function gets done on the html including the xpath attributes. This should not be the case. This gives different results when using xpath vs. not using it and also implicitly defines a different len threshold for selecting the summary.

article_length = len(cleaned_article or "")

One idea might be: add the xpath attributes to the html at the end after all calculations have been done rather in the beginning:

best,
Thomas

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions