GOODREADS BOOK DATA

Growing up, I always loved books. My dream career was to be someone who got to read books for a living. However, I quickly realized that was perhaps an unrealistic and near impossible dream to achieve, so I pivoted to wanting to become a writer.

Although that aspiration faded over time, my love for reading never did. That is why I am fascinated by the dataset I have selected for this data analysis project.

The selected dataset has information on approximately 100,000 books from Goodreads – an online book database and social cataloguing website where users can get information on books, share reviews, and engage with other book lovers.

The original dataset has information on approximately 5,000,000 books, split into 23

excel files – with each file containing 100,000 rows of data. However, for the purpose of this analysis I have decided to work with the first file, containing data on only the first 100,000 books.

The dataset includes book information like the name of the book, number of pages, the day and year it was published, author, publisher, number of reviews on Goodreads, and the rating the book currently has on Goodreads. The book rating is determined by users on Goodreads who have read the book and are able to rate it based on their preferences, on a scale from 1 – 5. The rating system has been adjusted in the last few years to account for fairness and relevancy; Any ratings that feature comments that are irrelevant or off-topic to the book are deleted and do not contribute to the overall rating displayed on Goodreads.

My goal with this dataset was to analyze the relationship between various variables regarding book data. I did not have many hypotheses as to which variables may be related, whether through correlation or causation, however, I was curious as to how some variables may affect others.

Some of the variables of interest within this dataset were – Number of Pages, Counts of Reviews, Publish Year, Publisher, Author, and Rating. More specifically, my goal with this dataset was to see which variables affect the rating of a book. The variables included in this dataset are quite interesting and I am curious to see whether any variables influence ratings.

Through analyses on R and excel, I have analyzed the various relationships within the dataset and have presented my findings in this report.