Scraping Data

Tue, Apr 14, 2020

Readings

Mining Social Media Scraping for Journalists Octoparse Web Robots ParseHub ScrapeHub Outwit Hub Import.io

Launch Google Sheets

You can start a new Google Spreadsheet at the following URL:

First, let’s explore the Website for Amador County Health Department.

Type in the following formula to cell A1:

=importhtml("https://www.amadorgov.org/services/covid-19/-fsiteid-1", "table", 1)

Next, let’s scrape a website using Xpath.

Visit:

When told to, we will type in the following:

=importxml("https://www.cisionjobs.co.uk/jobs/journalist/","//*[@id='listing']")

We’ll start by using Google Collab at the following URL:

You’ll need to use your Google Account to log in.

Click the Github tab, and search for J220-Intro-Coding/scraping_example.

During the in-class exercise, I will go over how to parse the website to locate the appropriate content. We will arrive at the following together.

soup.find_all('table')[3].find_all('td')[1].text.strip()
soup.find_all('table')[3].find_all('td')[3].text.strip()

from google.colab import files
mycsv.to_csv('myfile.csv')