Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
220 views
in Technique[技术] by (71.8m points)

python - How to get the exact frequency for trigram in text data?

I would like to know how to get the exact frequency for trigrams. I think the functions I used are more to get the "importance". It's kind of like the frequency but not the same.

To be clear, a trigram is 3 words in a row. The punctuation does not afect the trigram unit, I don't want to at least.

And my definition of the frequency is : I would like the number of comments of which the trigram are in , at least once.

Here’s how I obtained my database with web scraping :

import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime
import time
import random


root_url = 'https://fr.trustpilot.com/review/www.gammvert.fr'
urls = [ '{root}?page={i}'.format(root=root_url, i=i) for i in range(1,807) ]

comms = []
notes = []
dates = []

for url in urls: 
    results = requests.get(url)

    time.sleep(20)

    soup = BeautifulSoup(results.text, "html.parser")

    commentary = soup.find_all('section', class_='review__content')

    for container in commentary:

        try:
            comm  = container.find('p', class_ = 'review-content__text').text.strip()

        except:
            comm = container.find('a', class_ = 'link link--large link--dark').text.strip()

        comms.append(comm)

        note = container.find('div', class_ = 'star-rating star-rating--medium').find('img')['alt']
        notes.append(note)

        date_tag = container.div.div.find("div", class_="review-content-header__dates")
        date = json.loads(re.search(r"({.*})", str(date_tag)).group(1))["publishedDate"]

        dates.append(date)

data = pd.DataFrame({
    'comms' : comms,
    'notes' : notes,
    'dates' : dates
    })

data['comms'] = data['comms'].str.replace('
', '')

data['dates'] = pd.to_datetime(data['dates']).dt.date
data['dates'] = pd.to_datetime(data['dates'])

data.to_csv('file.csv', sep=';', index=False)

Here’s the function I used to obtained my comms_clean :

def clean_text(text):
    text = tokenizer.tokenize(text)
    text = nltk.pos_tag(text)
    text = [word for word,pos in text if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')
]
    text = [word for word in text if not word in stop_words]
    text = [word for word in text if len(word) > 2]
    final_text = ' '.join( [w for w in text if len(w)>2] ) #remove word with one letter
    return final_text

data['comms_clean'] = data['comms'].apply(lambda x : clean_text(x))

data['month'] = data.dates.dt.strftime('%Y-%m')

And here’s some row of my database :

database

And here the function I used to obtained the frequency of trigram in my database :

def get_top_n_gram(corpus,ngram_range,n=None):
    vec = CountVectorizer(ngram_range=ngram_range,stop_words = stop_words).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

def process(corpus):
    corpus = pd.DataFrame(corpus, columns= ['Text', 'count']).sort_values('count', ascending = True)
    return corpus

Here's the result with this line of code :

trigram = get_top_n_gram(data['comms_clean'], (3,3), 10)

trigram = process(trigram)
trigram.sort_values('count', ascending=False, inplace=True)

trigram.head(10)

trigram

Let me show you how it seems inconsistent but by short amount. I will show the 6 first trigram of my picture above :

df = data[data['comms_clean'].str.contains('très bon état',regex=False, case=False, na=False)]

df.shape

(150, 5)



df = data[data['comms_clean'].str.contains('rapport qualité prix',regex=False, case=False, na=False)]

df.shape

(148, 5)



df = data[data['comms_clean'].str.contains('très bien passé',regex=False, case=False, na=False)]

df.shape

(129, 5)

So with my function we have :

146
143
114

and when I checked for the number of comment with that trigram in it, I obtained :

150
148
129

It’s not so far, but I rather have the exact number.

So I would like to know: How to have the exact frequency for that trigram? And not some kind of importance. The importance is fine, don't get me wrong, but I also would like to know the right number.

I tried this :

from nltk.util import ngrams

for i in range(1,16120):
    Counter(ngrams(data['comms_clean'][i].split(), 3))

But I cannot find how to concatenate all the counter in the loop.

Thank you.

EDIT :

stop_words = set(stopwords.words('french'))
stop_words.update(("Gamm", "gamm"))

tokenizer = nltk.tokenize.RegexpTokenizer(r'w+')
lemmatizer = French.Defaults.create_lemmatizer()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...