Small Python Projects: Build a News Dataset

One of the easiest projects that you can do in Python is creating a dataset by scraping a particular website In this project , we will use the PyGoogleNews library to extract Google News elements. We will optimize this this web scrapper to focus on a particular keyword, language and search engine location. Additionally, you will learn how to translate this with Texblob library and also create sentiment analysis on the titles.

Follow Along with the Video

Let’s Dig into the Code:

#let's add the libraries
from pygooglenews import GoogleNews
import pandas as pd

#create the Google News API
gn = GoogleNews(lang='jp',country="JP")

# lets create a dictionary so that we can get the date of publish, link and title
def get_titles(keyword):
  news= []
  gn=GoogleNews(lang='jp',country='JP')
  search = gn.search(keyword)
  articles = search['entries']
  for i in articles:
   article= {'title': i.title, 'link': i.link,"published":i.published}
   news.append(article)
  return news

data = get_titles("ポケットモン")

#lets save a data frame so that we can start translating what we have
df = pd.DataFrame(data)

# Here is texblob our natural language processing library
from textblob import TextBlob

# We use translate to with a from language to language
blob.translate(from_lang='ja', to='en')

# let's create a function that bring back sentiment and translateions
def translation(text):
  blob =TextBlob(text)
  return str(blob.translate(from_lang='ja', to='en'))
  
def sentiment(text):
  blob=TextBlob(text)
  return blob.sentiment.polarity

df['translation'] = df['title'].apply(translation)
df['sentiment'] =df['translation'].apply(sentiment)

# lets create an actual class 
import numpy as np

df['Sentiment Class']  = np.where(df['sentiment']<0,"negative",
                                  np.where(df['sentiment']>0,"positive",
                                           "neutral"))
#lets export the file 
df.to_excel('output_file.xlsx')

Gaelim Holland

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments