Skip to content

Readability of MSN articles #181

Open
@rpdelaney

Description

@rpdelaney

I'm struggling to get this working with MSN news articles. Here's the approach I'm using:

def fetch_url(url: str, timeout: int = 10) -> str:
    """Get the content from a page at URL, if it is a URL."""
    if not is_url(url):
        return url

    response = requests.get(url, timeout=timeout)
    response.raise_for_status()
    soup = bs(response.content, "html.parser")

    return soup.get_text()


def summarize(content: str) -> str:
    """Take content and use readability to return a document summary."""
    doc = Document(content)

    title: str = doc.short_title()
    summary: str = bs(doc.summary(), "lxml").text

    return f"{title}\n{summary}"

This works well on all the other news sites I've tried, but with MSN it's different.

Example. With this URL, I only get MSN for a title and the summary is empty.

Any suggestions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions