Web-Mining Full Project Report

2020-06-26

Word count: 1.9k | Reading time≈ 11 min

In retrospect my first NLP project,it was a really great learning experience that got me through the process of nature language processing, and piqued my interest to the Data Science.

Abstract

　In the past one year, PUBG became one of the most popular games in the word and at the same time it also received thousands of negative comments and thus I planned to start a project to help those developers and operators extract valuable information from those critics. The purpose of my project is to get suggestions from those critics, and then I plan to assign labels for players manually to classify those clear-minded supporters and critics and then train some classification models to discover more critics with firm stand.Finally, I will propose some future improvements for this project which I could have done better and also as reference for my next project.

Background

　PUBG, one of the most popular shooting games on steam, was published by PUBG Corporation, a subsidiary of South Korean video game company Bluehole. This game was released officially in September 2018. The game is one of the best-selling of all time, with over fifty million sold across all platforms by June 2018. In addition, the Windows version holds a peak concurrent player count of over three million on Steam, which is an all-time high on the platform. PUBG received thousands of reviews from players, who found that there are still many bugs and problems existed in the game.Thus, I decide to research on discovering useful information from those reviews.

Purpose

　In this project, I am trying to detect some useful information from the reviews and have a deep understanding of the game by analysis crawled from the Steam reviews community. After this project, I hope my analysis will help game developers identify the problems that most players concerned about.

Data Preparation

　In this project, the data source comes from Steam.I use Selenium to scrape some reviews page.I use CSS selectors to get all the attributes including user comment date, user reviews, whether users recommend or not, how many hours spent on the game, and how many games users hold in the account.

Data Description

　The data size is 2070. The data contains three independent variables including spent hours, hold products and user reviews, and one dependent variable which is “recommended or not”. Spent hours refers to how many hours that players have spent on the game. Hold products refers to how many games user hold in their accounts. User reviews refer to what players comment on the website. Recommend or not refers to whether the player recommends this game or not. The sample data presented below:
upload successful

　On the platform, each user can easily write down their feedbacks toward the game and they will assign a label of whether they want to recommend or not-recommend this game to other players. Steam will also list the information of users such as hours of playing, the Steam products in the user’s account, and other players’ attitudes toward this comment. This review system will give the game companies or developers an overall insight into how this running on this platform by collecting all the data from users’ reviews. The user interface shows below:
upload successful

Data cleaning

　The data contains some symbols and stop words which cause problems for me to find meaningful information from the user reviews and thus I remove these noise and transfer all characters into lower cases and split the reviews into words.

Exploratory Data Analysis

　Next, I did some simple exploratory data analysis to find the correlation among these variables. Firstly, I made a bar chart to present the relation between spent hours and recommendations
upload successful
As the bar chart presents above, I find that the people who have spent more than 1900 hours on PUBG tend to recommend this game. In other words, this game is very popular with old players, and thus the reviews made by this part of players are more meaningful for the operator.
And then, I made a scatter plot to explore the relation between spent hours and hold products.

upload successful

Modeling

　In this part, I use both supervised learning algorithms and unsupervised learning algorithms to make models.

Unsupervised learning

VADER

　VADER analyzes a piece of text to see if any of the words in the text is present in the lexicon. Sentiment metrics are derived from the ratings of such words positive, neutral, and negative, represent the proportion of the text that falls into those categories.The final metric, the compound score, is the sum of all the lexicon ratings which have been standardized to range between -1 and 1 based on some heuristics.In my project, I want to see users’ sentiment for each of the reviews by applying VADER analysis, and group those negative sentiment reviews for company inspect.

upload successful

　From the result shown above, reviews have been classified into three categories: positive sentiment, neutral sentiment, and negative sentiment.In positive sentiment,77.45% of reviews have been correctly　classified;In neutral sentiment,75.33 are recommended reviews and 24.66% are not recommended reviews; In negative sentiment, 26.38% negative reviews　have been correctly classified.I visualize the model result.The model has bad performance on negative sentiment. And then I try to find the reasons.After I dig into the reviews, I find out the reviews sometimes full of sarcasm, the computer is not capable of recognizing the emotion behind the sentences. For example: “This game used to be good, but now it is just game for cheaters.” Then I got a conclusion that the labels in the review system didn’t truly reflect the attitudes of players and the game developer can also possibly mislead by the data of labels.

LDA

　Due to the massive information of reviews, it’s difficult for operators to analyze the reviews one by one. So, I use LDA (Latent Dirichlet allocation) to generate topics for the document. This model reduces the dimensionalities of words and thus it worked more efficiently compared with bag-of-words model and got rid of overfitting. I visualized three top topics as below:

upload successful
　Through these pictures, I found more negative words than positive words. These words state that the server of the game is bad and mountains of cheaters in this game and this game didn’t make any improvement over time. This result may confuse us why there are more negative words about this game, and in the EDA stage I found more people recommended this game.In the VADER model, I have found that 75.33% of people who have neutral attitudes have been classified as supporters. And by reading some comments, I found that these people’s comments contain a lot of negative words and that’s the reason why I found so many negative words here. Through the LDA and VADER model, I got a conclusion that I cannot simply classify people by their comments’ labels, and thus, I planned to select those clear-minded comments to train the classification model to find more players with a firm stand.

Supervised learning

　The classification models are applied to detect positive and negative users. Since there are a great many reviews are not labeled in other sources. It is necessary to classify a large number of reviews and get directions for later improvement, especially from the negative sentiments.

Multinomial Naïve Bayes

　Multinomial Naïve Bayes is suitable for classification with word counts for text classification, such as TF-IDF weight.
As the result shows above, this model has a good performance. The precision rate is 0.84 and the recall rate is 0.82.
upload successful

Support Vector Classification

　Support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite-dimensional space, which can be used for classification. As the pictures are shown above, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class. It has 0.85 of precision rate and 0.83 of recall rate.

upload successful

Logistic Regression

　Logistic Regression model the probabilities of a recommendation as a linear function of documents term matrix and classify reviews into two categories based on TF-IDF weights of bag-of-words.
As the pictures are shown above, it has 84% precision rate and 80% recall rate. In summary, MNB, SVM, and Logistic Regression have good performance.

upload successful

CNN

　CNN model is introduced since it has advantages in imbalanced data classification. The doc2vec and word2vec are used to train a large number of unlabeled reviews to generate a fixed weights word matrix to map the word in the embedding phase. The doc2vec is different in predicting the word by concatenating the paragraph vector D (shared within the paragraph). The words vector trained by Doc2Vec has a better performance when checking the similarities of the words.
The figure describes the top 5 most similar word with ‘play’:

upload successful
The CNN model is using three sizes of filters (bigram, trigram, and quadrigram), each size has 64 filters in the convolutional layer.

upload successful
In summary, the CNN model that using a pre-trained matrix has a better performance as the picture shown above.

Word Interpretation

　To get some insights from negative sentiment reviews, I use TFIDF weights and Word2Vec to dig out content from reviews.
As the picture shows, I picked the top ten words which have the highest frequency. Some words such as bad, time, money are keywords that might point to the potential problem.
Next step, I use Word2Vec to find the most similar words with these top 10 words, trying to find correlations between targeted words.
As a result, the problems found are as follows:

A lot of game issues
Serious time delay for server lags
Devs(Developer) fix the bug
Vehicle issue in the game
Waste money
A lot of cheaters

Conclusion

　In unsupervised learning, I found that the label in the dataset cannot accurately describe the attitude of players, but the purpose of my project is to excavate valuable information from critics and thus it’s very important to train a model to find those critics with frim stand. And then I trained several classification models including Multinomial Naïve Bayes, SVM, Logistic Regression, and CNN, and found that CNN has the best performance among these models. Finally, I explain the words meaning after classification and then give the suggestion based on what I got from the result and raise the suggestion to the company. Wipe out cheaters from the game, it hugely impacts the user’s game experience. Give more support to users when they meet with game issues. Developers need to put more effort to optimize game and fix those existed bugs. Cautiously deal with the micro-transaction, make it reasonable to users.

Future Improvement

　There are still some improvements that I can make in the future. For example, I can try the Steam Game Platform API,and I can improve VADER by updating the word list. Since VADER is defined as Sentiment metrics techniques, I expect to improve model performance by updating the word list and re-training the model.

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.