Quantcast
Channel: Active questions tagged feedparser - Stack Overflow
Viewing all articles
Browse latest Browse all 106

How to combine Python Re and Feedparser module into one function

$
0
0

I have a feed parser function that successfully parses data from an RSS feed.

def get_posts_details(rss=None):    if rss is not None:        # import the library only when url for feed is passed        import feedparser        # parsing feed        blog_feed = feedparser.parse(rss)        # getting lists of blog entries via .entries        posts = blog_feed.entries        # dictionary for holding posts details        posts_details = {"Blog title" : blog_feed.feed.title,"Blog link" : blog_feed.feed.link        }        post_list = []        # iterating over individual posts        for post in posts:            temp = dict()            # if any post doesn't have information then throw error.            try:                temp["title"] = post.title                temp["link"] = post.link                temp["author"] = post.author                temp["time_published"] = post.published                temp["tags"] = [tag.term for tag in post.tags]                temp["authors"] = [author.name for author in post.authors]                temp["summary"] = post.summary            except:                pass            post_list.append(temp)        # storing lists of posts in the dictionary        posts_details["posts"] = post_list         return posts_details # returning the details which is dictionary    else:        return Noneif __name__ == "__main__":  import json  feed_url = "https://feeds.feedburner.com/SiliconCanals"  data = get_posts_details(rss = feed_url) # return blogs data as a dictionary  if data:    # printing as a json string with indentation level = 2    print(json.dumps(data, indent=2))   else:    print("None")

All other parsed data looks ok. However I have an issue with the representation of the title and summary. It had HTML tags in it. They look like this:

"title": "Partech announces closing of its second Growth fund at \u20ac650M: Know more about the VC\u2019s investment plan with this fund""summary": "<a href=\"https://siliconcanals.com/crowdfunding/partech-second-growth-fund-of-650m/\" rel=\"nofollow\" title=\"Partech announces closing of its second Growth fund at \u20ac650M: Know more about the VC\u2019s investment plan with this fund\"><img alt=\"Partech\" class=\"webfeedsFeaturedVisual wp-post-image\" height=\"393\" src=\"https://siliconcanals.com/wp-content/uploads/2021/11/Omri-Benayoun-and-Bruno-Cremel-750x393.jpg\" style=\"display: block; margin: auto; margin-bottom: 5px;\" width=\"750\" /></a>Partech is a global investment platform for tech and digital companies, led by ex-entrepreneurs and operators of the industry spread across offices in San Francisco, Paris, Berlin, and Dakar." 

I have another strip HTML function that uses the 're' import in python to clean up such text.

import redef stripHTML(data):    p = re.compile(r'<.*?>')    return p.sub('', data)output = stripHTML('<a href=\"https://siliconcanals.com/crowdfunding/partech-second-growth-fund-of-650m/\" rel=\"nofollow\" title=\"Partech announces closing of its second Growth fund at \u20ac650M: Know more about the VC\u2019s investment plan with this fund\"><img alt=\"Partech\" class=\"webfeedsFeaturedVisual wp-post-image\" height=\"393\" src=\"https://siliconcanals.com/wp-content/uploads/2021/11/Omri-Benayoun-and-Bruno-Cremel-750x393.jpg\" style=\"display: block; margin: auto; margin-bottom: 5px;\" width=\"750\" /></a>Partech is a global investment platform for tech and digital companies, led by ex-entrepreneurs and operators of the industry spread across offices in San Francisco, Paris, Berlin, and Dakar.')print(output)

The output is:

Partech is a global investment platform for tech and digital companies, led by ex-entrepreneurs and operators of the industry spread across offices in San Francisco, Paris, Berlin, and Dakar.

How do I combine both functions into one function to display raw text as my output when get_posts_details() runs and get an title & summary outputs like this:

"title": "Partech announces closing of its second Growth fund at €650M: Know more about the VC’s investment plan with this fund""summary": "Partech is a global investment platform for tech and digital companies, led by ex-entrepreneurs and operators of the industry spread across offices in San Francisco, Paris, Berlin, and Dakar." 

Viewing all articles
Browse latest Browse all 106

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>