A research and exploration capstone project for CIS 5190 - Applied Machine Learning by Rayyan Shaik, Esther Armao, Helen Nguyen

This project aims predict the virality of a given pop song given only its lyrics.

We split this project into 2 main contributions, which are outlined below:

Contribution 1: Dataset of Pop Songs, Lyrics, Streaming Data

All data (selection of pop songs, their respective lyrics and historical streaming data) was webscraped using python A link to the dataset generation repository can be found here: Music-Project-Scraper

  1. Step 1: Scraping a list of pops songs (names & artists) by year
  • Done by scraping Apple Music
  1. Step 2: Scraping lyrics and lyrics meta-data per song
  • Done by scraping Genius and using Genius’ API
  1. Step 3: Weekly streaming data (global & US) per song
  • Done by scraping kworb.net for spotify streaming data on each of the songs from step 1

Further feature extraction was done across all of the song lyrics

  • Sentiment Analysis using NLTK’s VADER Sentiment Analyzer
  • Bag of Words using CountVectorizer from scikit-learn

Contribution 2: Model to evaluate an input song’s “virality”

Model Results

  • Accuracy: ~70%
  • F1 Scores: 48%, 77%, 0% for classes 0, 1 and 2 respectively

Our Neural Network Architecture

nn-architecture.png

This was group effort created as a capstone project for CIS 5190 - Applied Machine Learning at UPenn