I have a feed parser function that successfully parses data from an RSS feed.
def get_posts_details(rss=None): if rss is not None: # import the library only when url for feed is passed import feedparser # parsing feed blog_feed = feedparser.parse(rss) # getting lists of blog entries via .entries posts = blog_feed.entries # dictionary for holding posts details posts_details = {"Blog title" : blog_feed.feed.title,"Blog link" : blog_feed.feed.link } post_list = [] # iterating over individual posts for post in posts: temp = dict() # if any post doesn't have information then throw error. try: temp["title"] = post.title temp["link"] = post.link temp["author"] = post.author temp["time_published"] = post.published temp["tags"] = [tag.term for tag in post.tags] temp["authors"] = [author.name for author in post.authors] temp["summary"] = post.summary except: pass post_list.append(temp) # storing lists of posts in the dictionary posts_details["posts"] = post_list return posts_details # returning the details which is dictionary else: return Noneif __name__ == "__main__": import json feed_url = "https://feeds.feedburner.com/SiliconCanals" data = get_posts_details(rss = feed_url) # return blogs data as a dictionary if data: # printing as a json string with indentation level = 2 print(json.dumps(data, indent=2)) else: print("None")
All other parsed data looks ok. However I have an issue with the representation of the title and summary. It had HTML tags in it. They look like this:
"title": "Partech announces closing of its second Growth fund at \u20ac650M: Know more about the VC\u2019s investment plan with this fund""summary": "<a href=\"https://siliconcanals.com/crowdfunding/partech-second-growth-fund-of-650m/\" rel=\"nofollow\" title=\"Partech announces closing of its second Growth fund at \u20ac650M: Know more about the VC\u2019s investment plan with this fund\"><img alt=\"Partech\" class=\"webfeedsFeaturedVisual wp-post-image\" height=\"393\" src=\"https://siliconcanals.com/wp-content/uploads/2021/11/Omri-Benayoun-and-Bruno-Cremel-750x393.jpg\" style=\"display: block; margin: auto; margin-bottom: 5px;\" width=\"750\" /></a>Partech is a global investment platform for tech and digital companies, led by ex-entrepreneurs and operators of the industry spread across offices in San Francisco, Paris, Berlin, and Dakar."
I have another strip HTML function that uses the 're' import in python to clean up such text.
import redef stripHTML(data): p = re.compile(r'<.*?>') return p.sub('', data)output = stripHTML('<a href=\"https://siliconcanals.com/crowdfunding/partech-second-growth-fund-of-650m/\" rel=\"nofollow\" title=\"Partech announces closing of its second Growth fund at \u20ac650M: Know more about the VC\u2019s investment plan with this fund\"><img alt=\"Partech\" class=\"webfeedsFeaturedVisual wp-post-image\" height=\"393\" src=\"https://siliconcanals.com/wp-content/uploads/2021/11/Omri-Benayoun-and-Bruno-Cremel-750x393.jpg\" style=\"display: block; margin: auto; margin-bottom: 5px;\" width=\"750\" /></a>Partech is a global investment platform for tech and digital companies, led by ex-entrepreneurs and operators of the industry spread across offices in San Francisco, Paris, Berlin, and Dakar.')print(output)
The output is:
Partech is a global investment platform for tech and digital companies, led by ex-entrepreneurs and operators of the industry spread across offices in San Francisco, Paris, Berlin, and Dakar.
How do I combine both functions into one function to display raw text as my output when get_posts_details()
runs and get an title & summary outputs like this:
"title": "Partech announces closing of its second Growth fund at €650M: Know more about the VC’s investment plan with this fund""summary": "Partech is a global investment platform for tech and digital companies, led by ex-entrepreneurs and operators of the industry spread across offices in San Francisco, Paris, Berlin, and Dakar."