Problem Statement As an information scientist for the marketing division at reddit.

Problem Statement As an information scientist for the marketing division at reddit.

i have to discover the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages so we may use them to ascertain which adverts should populate for each web page. Because this is a category issue, we’ll make use of Logistic Regression & Bayes models. Misclassifications in this full situation is fairly benign and so I will make use of the precision rating and set up a baseline of 63.3per cent to price success. Making use of TFiDfVectorization, I’ll get the function value to ascertain which terms have actually the highest forecast energy for the goal variables. If effective, this model is also utilized to focus on other pages which have comparable regularity for the words that are same expressions.

Data Collection

See dating-advice-scrape and relationship-advice-scrape notebooks because of this component.

After switching all of the scrapes into DataFrames, we stored them as csvs that you can get when you look at the dataset folder of the repo.

Information Cleaning and EDA

  • dropped rows with null self text column becuase those rows are worthless if you ask me.
  • combined name and selftext column directly into one new columns that are all_text
  • exambined distributions of word counts for games and selftext column per post and contrasted the two subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 this means if i usually select the value that develops oftentimes, i will be appropriate 63.3% of that time period.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first set of scraping, pretty bad rating with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got a whole lot worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74percent

Merely increasing the information and stratifying y in my test/train/split increased my cvec test score to 81 and cross val to 80. Including 2 paramaters to my CountVectorizers helped a lot. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a get a cross val to 82.3 Nonetheless, these rating disappeared.

I do believe Tfidf worked the very best to reduce my overfitting due to variance problem because

we customized the end terms to simply take away the ones which were really too regular to be predictive. This is a success, but, with additional time we most likely could’ve tweaked them a little more to boost all ratings. Considering both the solitary terms and terms in sets of two (bigrams) ended up being the most readily useful param that gridsearch advised, but, each of my top many predictive terms wound up being uni-grams. My list that is original of had a good amount of jibberish terms and typos. Minimizing the # of that time period term had been needed to show up to 2, helped be rid of these. Gridsearch additionally advised 90% max df rate which aided to get rid of oversaturated terms aswell. Finally, establishing max features to 5000 reduced cut down my columns to about one fourth of whatever they had been to simply focus probably the most frequently employed terms of the thing that was kept.

Summary and tips

Also I was able to successfully lower the variance and there are definitely several words that have high predictive power though I would like to have higher train and test scores

so I think the model is prepared to introduce a test. The same key words could be used to find other potentially lucrative pages if advertising engagement increases. I discovered it interesting that taking right out the overly used terms assisted with overfitting, but brought the precision rating down. I believe there was probably nevertheless room to relax and play around with the paramaters of this Tfidf Vectorizer to see if various end terms create an or that is different


Used Reddit’s API, needs collection, and BeautifulSoup to clean articles from two subreddits: Dating information & union information, and trained a classification that is binary to anticipate which subreddit confirmed post originated from

Leave a Reply

Генерация пароля