Dev.to US tech 2026-05-08 16:05

Redditが教えてくれる女性の時計の好みについて（Python + NLPプロジェクト）

原題: What Reddit Can Teach Us About Women’s Watch Preferences (Python + NLP Project)

分析結果

カテゴリ: IT
重要度: 62
トレンドスコア: 24
要約: この記事では、Redditのデータを用いて女性の時計の好みを分析するプロジェクトについて説明しています。Pythonと自然言語処理（NLP）を活用し、ユーザーの投稿やコメントから女性が好む時計のスタイルや機能を抽出し、トレンドを明らかにします。この分析により、時計業界におけるマーケティング戦略や製品開発に役立つ洞察が得られることを目指しています。
キーワード: posts text self get post vectorizer sentiment reddit

Most “what watch should I buy?” discussions online skew heavily male. A friend wanted to launch a women’s watch, so I helped with a small data analysis. In this post I’ll walk through a small but complete Python pipeline I built: Scrape relevant posts and comments from Reddit with no API keys Filter out irrelevant posts (e.g. men asking for themselves) Run NLP analysis: sentiment, brands, features, prices, keywords, clustering, topic modeling Generate visualizations and CSVs you can explore further Everything here is powered by standard Python libraries: requests, pandas, nltk, scikit‑learn, and wordcloud. 1. Collecting Reddit data without API keys We didn’t use the official Reddit API; instead we hit the public JSON endpoints directly using requests. At the top of reddit_json_scraper.py we define search URLs across multiple subreddits: Each URL returns a JSON blob; we wrap that in a helper: Reddit’s listing JSON has a fairly nested structure, so we created extract_post_data to normalize it into a flat dictionary with the fields we actually care about (ID, subreddit, title, body, score, comment count, timestamps, etc.): def extract_post_data ( post_json ): """ Extract relevant information from a Reddit post JSON """ try : data = post_json [ ' data ' ] return { ' post_id ' : data . get ( ' id ' , '' ), ' subreddit ' : data . get ( ' subreddit ' , '' ), ' title ' : data . get ( ' title ' , '' ), ' text ' : data . get ( ' selftext ' , '' ), ' author ' : data . get ( ' author ' , '' ), ' score ' : data . get ( ' score ' , 0 ), ' upvote_ratio ' : data . get ( ' upvote_ratio ' , 0 ), ' num_comments ' : data . get ( ' num_comments ' , 0 ), ' created_utc ' : data . get ( ' created_utc ' , 0 ), ' created_date ' : datetime . fromtimestamp ( data . get ( ' created_utc ' , 0 )). strftime ( ' %Y-%m-%d %H:%M:%S ' ), ' url ' : f " https://reddit.com { data . get ( ' permalink ' , '' ) } " , ' post_url ' : data . get ( ' url ' , '' ), ' is_video ' : data . get ( ' is_video ' , False ), ' over_18 ' : data . get ( ' over_18 ' , False ) } The main collection loop simply iterates through all search URLs, fetches JSON, and appends normalized posts into a list: We also fetch comments for the most “interesting” posts, sorted by engagement ( score + num_comments ), by hitting each post’s .json endpoint and walking the comment tree. At the end of main() we save everything to CSV and run a quick text summary (brand and keyword counts, simple price stats). 2. Filtering: keeping posts that are really about women’s watches Search results are noisy. Some posts mention “women” but are actually men asking for themselves. filter_posts.py applies a simple but effective regex filter.We flag posts that contain phrases like “as a man” or “for men”: …but we keep posts that clearly talk about buying for a woman, e.g. “gift for my wife”: NON_FILTER_PATTERNS = r "(for|gift|buying|getting|choosing|help).*(mum|mom|mother|wife|girlfriend|partner|daughter|sister|woman|female|her|she)" filter_check combines title and text, applies these patterns, and filtered_posts_csv writes a cleaned filtered_posts.csv. This becomes the starting point for our analysis. 3. Analyzing the conversations with WatchDataAnalyzer The main analysis lives in watch_analyzer.py as a single class: Load the filtered posts and comments Combine titles, bodies, and comment text into all_text Set up NLTK and VADER sentiment 3.1. Light text cleaning We remove URLs and normalize whitespace, then build a combined_text column per post: 3.2. Sentiment on posts and comments Using VADER, we compute a compound score and label each post/comment as positive, neutral, or negative: self . posts_df [ ' sentiment_scores ' ] = self . posts_df [ ' combined_text ' ]. apply ( lambda x : self . sia . polarity_scores ( x )) self . posts_df [ ' sentiment_compound ' ] = self . posts_df [ ' sentiment_scores ' ]. apply ( lambda x : x [ ' compound ' ]) self . posts_df [ ' sentiment_label ' ] = self . posts_df [ ' sentiment_compound ' ]. apply ( lambda x : ' positive ' if x > 0.05 else ( ' negative ' if x <- 0.05 else ' neutral ' ) ) We do the same for comments and then plot the distribution, saving sentiment_dist.png . 3.3. Brands, price ranges, and features We look at three practical angles: Brand mentions — a curated list from Titan and Seiko to Rolex and Omega, counted across all text. def extract_brands ( self ): # Common watch brands brands = [ ' casio ' , ' seiko ' , ' citizen ' , ' timex ' , ' fossil ' , ' orient ' , ' tissot ' , ' michael kors ' , ' daniel wellington ' , ' mvmt ' , ' skagen ' , ' swatch ' , ' rolex ' , ' omega ' , ' cartier ' , ' tag heuer ' , ' breitling ' , ' patek philippe ' , ' audemars piguet ' , ' vacheron constantin ' , ' baume mercier ' , ' longines ' , ' hamilton ' , ' bulova ' , ' invicta ' , ' bering ' , ' titan ' , ' fastrack ' , ' sonata ' , ' maxima ' , ' hmt ' , ' raymond weil ' , ' zenith ' , ' iwc ' ] brand_mentions = {} # ... Price — regexes to capture Indian price patterns with ₹/rs/inr or “rupees”, then bucketed into budget/mid‑range/premium/luxury ranges. def extract_prices ( self ): # Patterns for price extraction patterns = [ r ' (?:₹|rs\.?|inr)\s*(\d+(?:,\d{3})*(?:\.\d+)?) ' , r ' (\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:₹|rs\.?|inr) ' , r ' (\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:rupees|rupee) ' , ] all_prices = [] for text in self . all_text : for pattern in patterns : matches = re . findall ( pattern , str ( text ), re . IGNORECASE ) for match in matches : price_str = match . replace ( ' , ' , '' ). replace ( ' . ' , '' ) all_prices . append ( int ( price_str )) ranges = { ' Budget (<₹5,000) ' : sum ( 1 for p in all_prices if p < 5000 ), ' Mid-range (₹5,000-₹20,000) ' : sum ( 1 for p in all_prices if 5000 <= p < 20000 ), ' Premium (₹20,000-₹1,00,000) ' : sum ( 1 for p in all_prices if 20000 <= p < 100000 ), ' Luxury (>₹1,00,000) ' : sum ( 1 for p in all_prices if p >= 100000 ) } Features — categories like size, material, movement, style, strap, and “features” (water resistance, sapphire, chronograph, etc.), each with their own keyword list. This gives a quick picture of which brands dominate, what price bands people discuss, and which attributes come up most. 3.4. Keywords, clusters, and topics Using scikit‑learn: TF‑IDF keywords — we build a TfidfVectorizer over combined_text and save the top terms to keywords_tfidf.csv . def extract_keywords ( self ): self . preprocess_all_text () vectorizer = TfidfVectorizer ( max_features = 80 , stop_words = ' english ' , min_df = 2 ) texts = self . posts_df [ ' combined_text ' ]. fillna ( '' ). tolist () X = vectorizer . fit_transform ( texts ) feature_names = vectorizer . get_feature_names_out () scores = X . mean ( axis = 0 ). A1 # Create keyword dataframe keywords_df = pd . DataFrame ({ ' keyword ' : feature_names , ' tfidf_score ' : scores }). sort_values ( ' tfidf_score ' , ascending = False ) Clustering — we cluster posts into 5 groups using K‑Means over TF‑IDF vectors, then inspect top words per cluster. def cluster_posts ( self , n_clusters = 5 ): """ Cluster posts based on text similarity """ # ... vectorizer = TfidfVectorizer ( max_features = 50 , stop_words = ' english ' , min_df = 2 ) texts = self . posts_df [ ' combined_text ' ]. fillna ( '' ). tolist () X = vectorizer . fit_transform ( texts ) # K-means clustering kmeans = KMeans ( n_clusters = n_clusters , random_state = 42 , n_init = 10 ) clusters = kmeans . fit_predict ( X ) self . posts_df [ ' cluster ' ] = clusters Topic modeling — we run LDA/NMF over the same vectors to discover high‑level themes (“budget gifts”, “small wrists and office wear”, “sporty/outdoor”, etc.). def topic_modeling ( self , n_topics = 5 , method = ' lda ' ): """ Perform topic modeling using LDA or NMF """ # ... vectorizer = TfidfVectorizer ( max_features = 100 , stop_words = ' english ' , ngram_range = ( 1 , 2 ), min_df = 2 , max_df = 0.95 ) # ... if method . lower () == ' lda ' : model = LatentDirichletAllocation ( n_components = n_topics , random_state = 42 ) else : # NMF model = NMF ( n_components = n_topics , random_state = 42 ) # ... for idx , topic in enumerate ( model . components_ ): top_words = [ feature_names [ i ] for i in topic . argsort ()[ - 10 :][:: - 1 ]] print ( f " \n Topic { idx + 1 } : { ' , ' . join ( top_words ) } " ) 4. Putting it all together The generate_report() method runs the full pipeline: Preprocess text Run sentiment, brand/feature/price extraction Compute keywords, clusters, and topics Generate a word cloud and sentiment plot Save everything to CSVs you can open in Excel or a notebook def generate_report ( self ): # 1. Pre- processing the text self . preprocess_all_text () # 2. Analyze sentiment sentiment_df = self . analyze_sentiment () # 3. Brands mentions brands = self . extract_brands () # 4. Features features = self . extract_features () # 5. Prices prices = self . extract_prices () # 6. Keywords keywords_df = self . extract_keywords () # 7. Clustering clusters = self . cluster_posts ( n_clusters = 5 ) # 8. Topic Modeling print ( " \n 🔍 Running topic modeling (this may take a moment)... " ) topic_model , vectorizer = self . topic_modeling ( n_topics = 5 , method = ' lda ' ) # 9. Visualizations print ( " \n 🎨 Creating visualizations... " ) self . create_wordcloud ( ' wordcloud.png ' ) self . plot_sentiment_distribution ( ' sentiment_dist.png ' ) # Save results ... It’s a compact example of how to go from raw Reddit JSON to structured insights about a very specific question: what are people really saying when they talk about women’s watches ? Most “what watch should I buy?” discussions online skew heavily male. A friend wanted to launch a women’s watch, so I helped with a small data analysis. In this post I’ll walk through a small but complete Python pipeline I built: Scrape relevant posts and comments from Reddit with no API keys Filter out irrelevant posts (e.g. men asking for themselves) Run NLP analysis: sentiment, brands, features, prices, keywords, cluster