Skip to content

Donc summary() won't work on this web site #112

Open
@MChrys

Description

@MChrys

Summary() seem don't work on website where text is spliting() in many tag .
I encoutered this problem specifically on this web site :
https://start.lesechos.fr/actu-entreprises/services/a-19-ans-il-est-le-plus-jeune-patissier-prime-au-guide-michelin-13983.php

url = "https://start.lesechos.fr/actu-entreprises/services/a-19-ans-il-est-le-plus-jeune-patissier-prime-au-guide-michelin-13983.php"
page = requests.get(url).text
doc  = Document(page)
doc.summary()
<html><body><div><div id="outer-main">\n\n\n\n<p class="ads tag1">\n\n</p>\n\n\n\n\n\n\n<a
href="" target="_blank" class="btn-piston "/>\n\n\n\n<article>\n<div id="content">\n<div
id="news">\n<div class="grid">\n<div class="contain">\n<div class="row">\n\n<div class="col
full">\n\n<span class="cat">Délices sucrés</span>\n<h1 class="page-title nobg">\nA 19 ans, il est le
plus jeune pâtissier primé au Guide Michelin</h1>\n<p class="meta">\n<span class="author">\nPar
Camille Wong</span>\n|\n<time datetime="2019-01-22T13:12">\n22/01/2019 à 14:30,</time>\nmis à
jour le 22/01/2019</p>\n\n\n<div class="picture first">\n<figure>\n\n<figcaption>\n<p
class="legend">Jessy Rhinn-Auvray (à gauche), 19 ans, et son mentor Nicolas Stamm, 46 ans, lors de la
cérémonie du Guide Michelin, le 21 janvier.\n <strong>@DR</strong
</p>\n</figcaption>\n</figure>\n</div>\n</div>\n\n</div>\n</div>\n</div>\n</div>\n</div>\n<
article>\n\n</div>\n\n\n</div></body></html>

almost all paragraph doesn't appear :

image

maybe you could add an option for Document object like :

if aggregation_mean == True: 
    aggregation = ""
    max = self.select_best_candidate(candidates).score
    min = self.select_worst_candidate(candidates).score
    for c in candidates : 
        if c.score >= max-min :  
            aggregation += c.text

return aggregation

I just tried to activate readable mode on safari , it's working perfectly on this page, it seems based on arc 90's as well

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions