Text Analysis on Comedy Specials:Part 1
Comedy is subjective.
Comedy is about delivering a story that gets a laugh. What you find funny, someone else might find it offensive or even tragic. However, what do we know about comedy?
Is there an actual formula? If so how would one explain it with emoticons?
Formula
๐ = ๐ข +ฮt
Where, Comedy, noted by ๐ is the sum an event causing great suffering relative one or a group of people, noted by ๐ข, and the passing of time relative to one or a group of people, noted by ฮt.
ok fine!
In what it looks like a multipart tutorial, in PART 1 we will extract the transcript text for several standups that were televised over the years.
Data Extraction
To my knowledge, there is no public API for standups specials or anything related to comedy, so we need to get our data somewhere else.
While searching, I stumbled on scrapsfromtheloft.com which had quite a few transcripts from several standup specials.
We will get our data from https://scrapsfromtheloft.com/tag/stand-up-transcripts/
Understanding the structure of the page
Before we get coding, we need to find what data we need to extract and how its displayed on the page.
Under https://scrapsfromtheloft.com/tag/stand-up-transcripts/
is a list of posts with the tag stand-up-transcripts
data to extract
- name of comedy special
- link to the transcript text
- name of the comedian
- summary text of the special
Under Chrome we can inspect the page with the keyboard shortcut Ctrl + Shift + I, or by right cliking and selecting โinspectโ.
We can see that each post is under <div class='fusion-post-content post-content'>
- The title and transcript link are under the
<H2>
and<a>
tag respectively - We can extract the name of the comedian from the list of tags under
<span class='meta-tags'>
- The summary text is under
<div class='fusion-post-content-container'>
Web Scraping
tools & libraries
- Python
- requests
- BeautifulSoup
- Pandas
- Jupyter Notebook
Now that we know what data to extract, we will need to find a way to download the page.
I found a great blog post on best practice on how to use requests so I will modify some of his code to fit my need.
Steps
- create a functon called
requests_retry_session()
that extracts data from webpage - create a function to called
get_response()
that callsrequest_retry_session()
and handles errors and exceptions. - create a function called
bs_parser_post()
that parses and extracts the requierd data using Beautiful soup. - create a dataframe with the extracted data
|
|
|
|
Results
We now have a dataframe with the data we scraped but it needs some cleaning.
We can clean up the tag column to display the name only.
|
Change the tag column name and reorder the columns
|
Thats enough for today.
On the next tutorial we will extract the transcript text from each link in our dataframe.