Analysis Module
This module provides different analysis functions for the AdLibAPI object, such as text and image analysis, and visualizations.
load_data Function
- AdDownloader.analysis.load_data(data_path)[source]
Load ads data from an Excel file into a dataframe given a valid path.
- Parameters:
data_path (str) – A valid path to an Excel file containing ad data.
- Returns:
A dataframe containing ads data and an additional campaign duration column.
- Return type:
pandas.DataFrame
Example:
>>> from AdDownloader.analysis import * >>> data_path = "output/<project_name>/ads_data/<project_name>_processed_data.xlsx" >>> data = load_data(data_path)
preprocess Function
- AdDownloader.analysis.preprocess(text)[source]
Preprocesses the input text by tokenizing, lowercasing, removing stopwords, and lemmatizing.
- Parameters:
text (str) – The input text to be preprocessed.
- Returns:
A list of preprocessed words.
- Return type:
Example:
>>> tokens = data["ad_creative_bodies"].apply(preprocess) >>> tokens.head(3) 0 person earli vote open soon georgia wait take ... 1 2020 help turn year around find vote earli person 2 person earli vote open soon georgia wait take ...
get_word_freq Function
- AdDownloader.analysis.get_word_freq(tokens)[source]
Calculate word frequencies from a list of tokenized words using CountVectorizer.
- Parameters:
tokens (list) – A list of tokenized words, created using the preprocess function from this module.
- Returns:
A list of tuples, where each tuple contains a word and its corresponding frequency, sorted by frequency in descending order.
- Return type:
Example:
>>> freq_dist = get_word_freq(tokens) >>> print(f"Most common 3 keywords: {freq_dist[0:3]}") Most common 3 keywords: [('vote', 3273), ('elect', 1155), ('earli', 1125)]
get_sentiment Function
- AdDownloader.analysis.get_sentiment(captions)[source]
Retrieve the sentiment of the ad captions using two libraries: Vader from NLTK and TextBlob.
- Parameters:
captions (pandas.Series) – A pandas Series containing ad captions.
- Returns:
A tuple containing sentiment scores calculated using TextBlob and Vader.
- Return type:
Example:
>>> textblb_sent, nltk_sent = get_sentiment(data["ad_creative_bodies"]) >>> nltk_sent.head(3) 0 {'neg': 0.0, 'neu': 0.859, 'pos': 0.141, 'comp... 1 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound... 2 {'neg': 0.098, 'neu': 0.644, 'pos': 0.258, 'co... >>> textblb_sent.head(3) 0 0.125000 1 0.112500 2 0.142857
get_topics Function
- AdDownloader.analysis.get_topics(tokens, nr_topics=3)[source]
Perform topic modeling on a given set of tokens using Latent Dirichlet Allocation (LDA). The coherence score of the model can be improved by adjusting the number of topics or the hyperparameters of LDA.
- Parameters:
- Returns:
A tuple containing the trained LDA model, the topics, the coherence score, perplexity, log-likelihood, similarity and a dataframe with a topic assigned to each ad.
- Return type:
Example:
>>> lda_model, topics, coherence_lda, perplexity, log_likelihood, avg_similarity, topics_df = get_topics(tokens, nr_topics=5) Number of unique tokens: 435 Number of documents: 2000 Finished topic modeling for 5 topics. Coherence: 0.71; Perplexity: 51.78; Log-Likelihood: -104762.56; Similarity: 0.07 Topic 0: ['vote', 'elect', 'paramount', 'give', 'need', 'win', 'novemb', '3rd'] Topic 1: ['vote', 'earli', 'find', 'year', 'person', 'click', 'easi', 'wait'] ... >>> topics_df.head(3) dom_topic perc_contr topic_keywords 0 1 0.6444 vote, earli, find, year, person, click, easi, ... 1 1 0.6567 vote, earli, find, year, person, click, easi, ... 2 4 0.9138 ballot, return, vote, today, click, home, demo...
get_topic_per_caption Function
- AdDownloader.analysis.get_topic_per_caption(lda_model, vect_text, tf_feature_names, captions=None)[source]
Extract the main topic per caption using a trained LDA model from scikit-learn.
- Parameters:
lda_model (sklearn.decomposition.LatentDirichletAllocation) – A trained LDA model from scikit-learn.
vect_text (scipy.sparse.csr_matrix) – The document-term matrix (output of the vectorizer).
tf_feature_names (numpy.ndarray or list of str) – The feature names (words) from the vectorizer.
captions (pandas.Series) – A Series containing the original captions (optional).
- Returns:
A DataFrame containing the dominant topic, percentage contribution, topic keywords, and the original caption for each document.
- Return type:
pandas.DataFrame
Example:
>>> vectorizer = CountVectorizer(stop_words = stop_words, max_features = 1000, min_df = 5, max_df = 0.95) >>> vect_text = vectorizer.fit_transform(tokens) # assuming the tokens are already processed captions >>> tf_feature_names = vectorizer.get_feature_names_out() >>> lda_model = LatentDirichletAllocation(n_components=5, learning_method='online', random_state=0, max_iter=10, learning_decay=0.7, learning_offset=10).fit(vect_text) >>> topics_df = get_topic_per_caption(lda_model, vect_text, tf_feature_names) >>> topics_df.head(3) dom_topic perc_contr topic_keywords 0 1 0.6444 vote, earli, find, year, person, click, easi, ... 1 1 0.6567 vote, earli, find, year, person, click, easi, ... 2 4 0.9138 ballot, return, vote, today, click, home, demo...
start_text_analysis Function
- AdDownloader.analysis.start_text_analysis(text_data, column_name='ad_creative_bodies', topics=False)[source]
Perform text analysis including preprocessing, word frequency calculation, sentiment analysis, and topic modeling. If topics = False, the function will only return the tokens, frequency distribution, word cloud, and text sentiment, otherwise it will additionally return the LDA model, coherence and a dataframe with assigned topics to each ad.
- Parameters:
data (pandas.DataFrame) – A pandas DataFrame containing an ad_creative_bodies column with ad captions.
column_name (str) – The name of the column in the Data Frame that contains the text data (default is “ad_creative_bodies”).
topics (bool) – If True, topic modelling will be performed in addition to the text and sentiment analysis.
- Returns:
A tuple containing the tokens, word frequency distribution, sentiment scores using TextBlob and Vader, the LDA model and its metrics.
- Return type:
Example:
>>> # without topic modeling >>> tokens, freq_dist, textblb_sent, nltk_sent = start_text_analysis(data) >>> # with topic modeling >>> tokens, freq_dist, textblb_sent, nltk_sent, lda_model, topics, coherence_lda, perplexity, log_likelihood, avg_similarity, topics_df = start_text_analysis(data) >>> # for output see all examples from above
transform_data_by_age Function
- AdDownloader.analysis.transform_data_by_age(data)[source]
Transform demographic data into long format, separating data by age ranges.
- Parameters:
data (pandas.DataFrame) – A pandas DataFrame containing demographic data.
- Returns:
A pandas DataFrame with columns ‘Reach’ and ‘Age Range’ in long format.
- Return type:
pandas.DataFrame
Example:
>>> import pandas as pd >>> data_path = "output/<project_name>/ads_data/<project_name>_processed_data.xlsx" >>> data = pd.read_excel(data_path) >>> data_by_age = transform_data_by_age(data) >>> data_by_age.head(3) Reach Age Range 0 7.0 18-24 1 0.0 18-24 3 23.0 65+
transform_data_by_gender Function
- AdDownloader.analysis.transform_data_by_gender(data)[source]
Transform demographic data into long format, separating data by gender.
- Parameters:
data (pandas.DataFrame) – A pandas DataFrame containing demographic data.
- Returns:
A pandas DataFrame with columns ‘Reach’ and ‘Gender’ in long format.
- Return type:
pandas.DataFrame
Example:
>>> # assuming data was already loaded >>> data_by_gender = transform_data_by_gender(data) >>> data_by_gender.head(3) Reach Gender 0 NaN female 1 68.0 female 2 243.0 male
get_graphs Function
- AdDownloader.analysis.get_graphs(data)[source]
Generate various graphs based on ad data. These include: * Total reach by ad_delivery_start_time * Total reach distribution (overall) * Number of ads per page * Top 20 pages with most ads * Total reach by page * Top 20 pages by reach * Ad campaign duration distribution * Ad campaign duration vs. total reachs * Reach across age ranges * Reach across genders
- Parameters:
data (pandas.DataFrame) – A pandas DataFrame containing ad data.
- Returns:
A tuple containing multiple Plotly figures representing different visualizations.
- Return type:
Example:
>>> fig1, fig2, fig3, fig4, fig5, fig6, fig7, fig8, fig9, fig10 = get_graphs(data) >>> fig1.show() # will open a webpage with the graph, which can also be saved locally
show_topics_top_pages Function
- AdDownloader.analysis.show_topics_top_pages(topics_df, original_data, n=20)[source]
Visualize the distribution of dominant topics for the top 20 pages by the number of ads.
- Parameters:
topics_df (pandas.DataFrame) – A pandas DataFrame containing data with dominant topics.
original_data (pandas.DataFrame) – A pandas DataFrame containing the original ad data.
n (int, optional) – The number of top pages to show topics for (default is 20).
- Returns:
A Plotly figure showing the distribution of dominant topics for the top 20 pages.
- Return type:
plotly.graph_objs._figure.Figure
Example:
>>> # using the output from `get_topics(tokens)` >>> fig = show_topics_top_pages(topics_df, data) >>> fig.show()
blip_call Function
- AdDownloader.analysis.blip_call(images_path, task='image_captioning', nr_images=None, questions=None)[source]
Perform image captioning or visual question answering using the BLIP model on a set of images.
- Parameters:
images_path (str) – Path to the directory containing images.
task (str, optional) – The task to perform (“image_captioning” is default or “visual_question_answering”).
nr_images (int, optional) – The number of images to process (default is None, which processes all images in the directory).
questions (str, optional) – A string containing one or more questions separated by question marks, used for visual question answering (default is None).
- Returns:
A pandas DataFrame containing image captions or answers to the provided questions.
- Return type:
pandas.DataFrame
Example:
>>> images_path = "output/<project_name>/ads_images" >>> img_caption = blip_call(images_path, nr_images=20) # captioning >>> img_caption.head(3) ad_id img_caption 0 689539479274809 a group of people eating pizza together 1 352527490742823 a couple of people sitting at a table eating p... 2 891711935895560 a man and woman eating pizza together >>> img_content = blip_call(images_path, task="visual_question_answering", nr_images=20, questions="Are there people in this ad?") >>> img_content.head(5) ad_id Are there people in this ad? 0 723805182773873 yes 1 871823271403675 no 2 6398713840181656 yes
extract_dominant_colors Function
- AdDownloader.analysis.extract_dominant_colors(image_path, num_colors=3)[source]
Extracts the dominant colors from an image using KMeans clustering.
This function processes the image by resizing it for efficiency and reshaping it into a list of pixels. It then uses KMeans clustering to find the most common colors in the image. The dominant colors are returned as HEX codes along with their percentages in the image.
- Parameters:
- Returns:
A tuple of two lists: the first list contains the HEX codes of the dominant colors, and the second list contains the percentages of these colors within the image.
- Return type:
Example:
>>> image_files = [f for f in os.listdir(images_path) if f.endswith(('jpg', 'png', 'jpeg'))] >>> dominant_colors, percentages = extract_dominant_colors(os.path.join(images_path, image_files[2])) >>> for col, percentage in zip(dominant_colors, percentages): ... print(f"Color: {col}, Percentage: {percentage:.2f}%") ... Color: #3a2f28, Percentage: 41.99% Color: #dfcbac, Percentage: 32.76% Color: #817875, Percentage: 25.24%
assess_image_quality Function
- AdDownloader.analysis.assess_image_quality(image_path)[source]
Assesses the quality of an image based on its resolution, brightness, contrast, and sharpness.
This function calculates the resolution as the product of image width and height, evaluates brightness as the average value of the pixel intensity, measures contrast as the standard deviation of grayscale pixel intensities, and assesses sharpness by the variance of the Laplacian of the grayscale image. High values of sharpness and contrast generally indicate a higher-quality image.
- Parameters:
image_path (str) – The file path of the image to assess.
- Returns:
A tuple containing the resolution (as a single integer value of width multiplied by height), brightness (average pixel intensity), contrast (standard deviation of pixel intensities), and sharpness (variance of the Laplacian) of the image.
- Return type:
Example:
>>> resolution, brightness, contrast, sharpness = assess_image_quality(os.path.join(images_path, image_files[2])) >>> print(f"Resolution: {resolution} pixels, Brightness: {brightness}, Contrast: {contrast}, Sharpness: {sharpness}") Resolution: 188400 pixels, Brightness: 142.5308, Contrast: 71.3726, Sharpness: 3691.4007
analyse_image Function
- AdDownloader.analysis.analyse_image(image_path)[source]
Analyzes an image by extracting its dominant colors, assessing image quality (resolution, brightness, contrast, sharpness), and detecting edges and corners.
This comprehensive analysis function performs multiple assessments on an image. It extracts the dominant colors and their proportions, calculates quality metrics such as resolution, brightness, contrast, and sharpness, and performs edge detection and corner detection to analyze the image’s structure. Additionally, it extracts an advertisement ID from the image file path using a predefined pattern.
- Parameters:
image_path (str) – The file path of the image to be analyzed.
- Returns:
A dictionary containing the analysis results, including ad ID, dominant colors and their proportions, resolution, brightness, contrast, sharpness, and the number of corners detected.
- Return type:
Example:
>>> analysis_result = analyse_image(os.path.join(images_path, image_files[2])) >>> print(analysis_result) {'ad_id': '1043287820216670', 'resolution': 188400, 'brightness': 142.53080148619958, 'contrast': 71.3726801705792, 'sharpness': 3691.40007606529, 'ncorners': 17, 'dom_color_1': '#817875', 'dom_color_1_prop': 41.943359375, 'dom_color_2': '#dfcbab', 'dom_color_2_prop': 32.8369140625, 'dom_color_3': '#3a2f28', 'dom_color_3_prop': 25.2197265625}
analyse_image_folder Function
- AdDownloader.analysis.analyse_image_folder(folder_path, nr_images=None)[source]
Analyzes a set of images in a specified folder and exports the results to an Excel file.
This function iterates over image files in the specified folder, performing an analysis on each image using the analyse_image function. The analysis covers extracting dominant colors, assessing image quality (resolution, brightness, contrast, sharpness), and detecting edges and corners. The results are compiled into a pandas DataFrame and then exported to an Excel file.
- Parameters:
folder_path (str) – The path to the folder containing the image files to be analyzed. The folder can contain images in jpg, png, and jpeg formats.
nr_images (int, optional) – The number of images to analyze from the folder. If None, all images in the folder are analyzed. This parameter allows for limiting the analysis to a subset of images.
- Returns:
A pandas DataFrame containing the analysis results for each image.
- Return type:
pandas.DataFrame
Example:
>>> df = analyse_image_folder(images_path, nr_images=20) >>> df.head(3) ad_id resolution brightness contrast sharpness ncorners dom_color_1 dom_color_1_prop dom_color_2 dom_color_2_prop dom_color_3 dom_color_3_prop 0 1039719343827470 187800 172.399936 60.601719 1585.668739 21 #ced2ce 55.395508 #a48b7d 28.369141 #464347 16.235352 1 1043131113478341 187800 108.217066 73.420019 903.498253 18 #1b1c17 45.996094 #96603d 33.593750 #dcbea0 20.410156 2 1043287820216670 188400 142.530801 71.372680 3691.400076 17 #3a2f28 41.992188 #817875 32.763672 #dfcbac 25.244141