Analysis Module

This module provides different analysis functions for the AdLibAPI object, such as text and image analysis, and visualizations.

load_data Function

AdDownloader.analysis.load_data(data_path)[source]

Load ads data from an Excel file into a dataframe given a valid path.

Parameters:: data_path (str) – A valid path to an Excel file containing ad data.
Returns:: A dataframe containing ads data and an additional campaign duration column.
Return type:: pandas.DataFrame

Example:

>>> from AdDownloader.analysis import *
>>> data_path = "output/<project_name>/ads_data/<project_name>_processed_data.xlsx"
>>> data = load_data(data_path)

preprocess Function

AdDownloader.analysis.preprocess(text)[source]

Preprocesses the input text by tokenizing, lowercasing, removing stopwords, and lemmatizing.

Parameters:: text (str) – The input text to be preprocessed.
Returns:: A list of preprocessed words.
Return type:: list

Example:

>>> tokens = data["ad_creative_bodies"].apply(preprocess)
>>> tokens.head(3)
0    person earli vote open soon georgia wait take ...
1    2020 help turn year around find vote earli person
2    person earli vote open soon georgia wait take ...

get_word_freq Function

AdDownloader.analysis.get_word_freq(tokens)[source]

Calculate word frequencies from a list of tokenized words using CountVectorizer.

Parameters:: tokens (list) – A list of tokenized words, created using the preprocess function from this module.
Returns:: A list of tuples, where each tuple contains a word and its corresponding frequency, sorted by frequency in descending order.
Return type:: list of tuple

Example:

>>> freq_dist = get_word_freq(tokens)
>>> print(f"Most common 3 keywords: {freq_dist[0:3]}")
Most common 3 keywords: [('vote', 3273), ('elect', 1155), ('earli', 1125)]

get_sentiment Function

AdDownloader.analysis.get_sentiment(captions)[source]

Retrieve the sentiment of the ad captions using two libraries: Vader from NLTK and TextBlob.

Parameters:: captions (pandas.Series) – A pandas Series containing ad captions.
Returns:: A tuple containing sentiment scores calculated using TextBlob and Vader.
Return type:: tuple

Example:

>>> textblb_sent, nltk_sent = get_sentiment(data["ad_creative_bodies"])
>>> nltk_sent.head(3)
0     {'neg': 0.0, 'neu': 0.859, 'pos': 0.141, 'comp...
1     {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
2     {'neg': 0.098, 'neu': 0.644, 'pos': 0.258, 'co...
>>> textblb_sent.head(3)
0     0.125000
1     0.112500
2     0.142857

get_topics Function

AdDownloader.analysis.get_topics(tokens, nr_topics=3)[source]

Perform topic modeling on a given set of tokens using Latent Dirichlet Allocation (LDA). The coherence score of the model can be improved by adjusting the number of topics or the hyperparameters of LDA.

Parameters:

tokens (list) – A list of tokenized words, created using the preprocess function from this module.
nr_topics (int, optional) – The number of topics to extract (default is 3).

Returns:

A tuple containing the trained LDA model, the topics, the coherence score, perplexity, log-likelihood, similarity and a dataframe with a topic assigned to each ad.

Return type:

tuple

Example:

>>> lda_model, topics, coherence_lda, perplexity, log_likelihood, avg_similarity, topics_df = get_topics(tokens, nr_topics=5)
Number of unique tokens: 435
Number of documents: 2000
Finished topic modeling for 5 topics.
Coherence: 0.71; Perplexity: 51.78; Log-Likelihood: -104762.56; Similarity: 0.07

Topic 0: ['vote', 'elect', 'paramount', 'give', 'need', 'win', 'novemb', '3rd']
Topic 1: ['vote', 'earli', 'find', 'year', 'person', 'click', 'easi', 'wait']
...
>>> topics_df.head(3)
   dom_topic  perc_contr                                     topic_keywords
0          1      0.6444  vote, earli, find, year, person, click, easi, ...
1          1      0.6567  vote, earli, find, year, person, click, easi, ...
2          4      0.9138  ballot, return, vote, today, click, home, demo...

get_topic_per_caption Function

AdDownloader.analysis.get_topic_per_caption(lda_model, vect_text, tf_feature_names, captions=None)[source]

Extract the main topic per caption using a trained LDA model from scikit-learn.

Parameters:

lda_model (sklearn.decomposition.LatentDirichletAllocation) – A trained LDA model from scikit-learn.
vect_text (scipy.sparse.csr_matrix) – The document-term matrix (output of the vectorizer).
tf_feature_names (numpy.ndarray or list of str) – The feature names (words) from the vectorizer.
captions (pandas.Series) – A Series containing the original captions (optional).

Returns:

A DataFrame containing the dominant topic, percentage contribution, topic keywords, and the original caption for each document.

Return type:

pandas.DataFrame

Example:

>>> vectorizer = CountVectorizer(stop_words = stop_words, max_features = 1000, min_df = 5, max_df = 0.95)
>>> vect_text = vectorizer.fit_transform(tokens) # assuming the tokens are already processed captions
>>> tf_feature_names = vectorizer.get_feature_names_out()

>>> lda_model = LatentDirichletAllocation(n_components=5, learning_method='online', random_state=0, max_iter=10, learning_decay=0.7, learning_offset=10).fit(vect_text)
>>> topics_df = get_topic_per_caption(lda_model, vect_text, tf_feature_names)
>>> topics_df.head(3)
   dom_topic  perc_contr                                     topic_keywords
0          1      0.6444  vote, earli, find, year, person, click, easi, ...
1          1      0.6567  vote, earli, find, year, person, click, easi, ...
2          4      0.9138  ballot, return, vote, today, click, home, demo...

start_text_analysis Function

AdDownloader.analysis.start_text_analysis(text_data, column_name='ad_creative_bodies', topics=False)[source]

Perform text analysis including preprocessing, word frequency calculation, sentiment analysis, and topic modeling. If topics = False, the function will only return the tokens, frequency distribution, word cloud, and text sentiment, otherwise it will additionally return the LDA model, coherence and a dataframe with assigned topics to each ad.

Parameters:

data (pandas.DataFrame) – A pandas DataFrame containing an ad_creative_bodies column with ad captions.
column_name (str) – The name of the column in the Data Frame that contains the text data (default is “ad_creative_bodies”).
topics (bool) – If True, topic modelling will be performed in addition to the text and sentiment analysis.

Returns:

A tuple containing the tokens, word frequency distribution, sentiment scores using TextBlob and Vader, the LDA model and its metrics.

Return type:

tuple

Example:

>>> # without topic modeling
>>> tokens, freq_dist, textblb_sent, nltk_sent = start_text_analysis(data)
>>> # with topic modeling
>>> tokens, freq_dist, textblb_sent, nltk_sent, lda_model, topics, coherence_lda, perplexity, log_likelihood, avg_similarity, topics_df = start_text_analysis(data)
>>> # for output see all examples from above

transform_data_by_age Function

AdDownloader.analysis.transform_data_by_age(data)[source]

Transform demographic data into long format, separating data by age ranges.

Parameters:: data (pandas.DataFrame) – A pandas DataFrame containing demographic data.
Returns:: A pandas DataFrame with columns ‘Reach’ and ‘Age Range’ in long format.
Return type:: pandas.DataFrame

Example:

>>> import pandas as pd
>>> data_path = "output/<project_name>/ads_data/<project_name>_processed_data.xlsx"
>>> data = pd.read_excel(data_path)
>>> data_by_age = transform_data_by_age(data)
>>> data_by_age.head(3)
      Reach Age Range
0     7.0       18-24
1     0.0       18-24
3     23.0       65+

transform_data_by_gender Function

AdDownloader.analysis.transform_data_by_gender(data)[source]

Transform demographic data into long format, separating data by gender.

Parameters:: data (pandas.DataFrame) – A pandas DataFrame containing demographic data.
Returns:: A pandas DataFrame with columns ‘Reach’ and ‘Gender’ in long format.
Return type:: pandas.DataFrame

Example:

>>> # assuming data was already loaded
>>> data_by_gender = transform_data_by_gender(data)
>>> data_by_gender.head(3)
      Reach Gender
0      NaN  female
1     68.0  female
2    243.0    male

get_graphs Function

AdDownloader.analysis.get_graphs(data)[source]

Generate various graphs based on ad data. These include: * Total reach by ad_delivery_start_time * Total reach distribution (overall) * Number of ads per page * Top 20 pages with most ads * Total reach by page * Top 20 pages by reach * Ad campaign duration distribution * Ad campaign duration vs. total reachs * Reach across age ranges * Reach across genders

Parameters:: data (pandas.DataFrame) – A pandas DataFrame containing ad data.
Returns:: A tuple containing multiple Plotly figures representing different visualizations.
Return type:: tuple

Example:

>>> fig1, fig2, fig3, fig4, fig5, fig6, fig7, fig8, fig9, fig10 = get_graphs(data)
>>> fig1.show() # will open a webpage with the graph, which can also be saved locally

show_topics_top_pages Function

AdDownloader.analysis.show_topics_top_pages(topics_df, original_data, n=20)[source]

Visualize the distribution of dominant topics for the top 20 pages by the number of ads.

Parameters:

topics_df (pandas.DataFrame) – A pandas DataFrame containing data with dominant topics.
original_data (pandas.DataFrame) – A pandas DataFrame containing the original ad data.
n (int, optional) – The number of top pages to show topics for (default is 20).

Returns:

A Plotly figure showing the distribution of dominant topics for the top 20 pages.

Return type:

plotly.graph_objs._figure.Figure

Example:

>>> # using the output from `get_topics(tokens)`
>>> fig = show_topics_top_pages(topics_df, data)
>>> fig.show()

blip_call Function

AdDownloader.analysis.blip_call(images_path, task='image_captioning', nr_images=None, questions=None)[source]

Perform image captioning or visual question answering using the BLIP model on a set of images.

Parameters:

images_path (str) – Path to the directory containing images.
task (str, optional) – The task to perform (“image_captioning” is default or “visual_question_answering”).
nr_images (int, optional) – The number of images to process (default is None, which processes all images in the directory).
questions (str, optional) – A string containing one or more questions separated by question marks, used for visual question answering (default is None).

Returns:

A pandas DataFrame containing image captions or answers to the provided questions.

Return type:

pandas.DataFrame

Example:

>>> images_path = "output/<project_name>/ads_images"
>>> img_caption = blip_call(images_path, nr_images=20) # captioning
>>> img_caption.head(3)
               ad_id                                        img_caption
0     689539479274809            a group of people eating pizza together
1     352527490742823  a couple of people sitting at a table eating p...
2     891711935895560              a man and woman eating pizza together
>>> img_content = blip_call(images_path, task="visual_question_answering", nr_images=20, questions="Are there people in this ad?")
>>> img_content.head(5)
            ad_id Are there people in this ad?
0   723805182773873                          yes
1   871823271403675                           no
2  6398713840181656                          yes

extract_dominant_colors Function

AdDownloader.analysis.extract_dominant_colors(image_path, num_colors=3)[source]

Extracts the dominant colors from an image using KMeans clustering.

This function processes the image by resizing it for efficiency and reshaping it into a list of pixels. It then uses KMeans clustering to find the most common colors in the image. The dominant colors are returned as HEX codes along with their percentages in the image.

Parameters:

image_path (str) – The file path of the image to be analyzed.
num_colors (int, optional) – The number of dominant colors to extract from the image. Defaults to 3.

Returns:

A tuple of two lists: the first list contains the HEX codes of the dominant colors, and the second list contains the percentages of these colors within the image.

Return type:

tuple(list, list)

Example:

>>> image_files = [f for f in os.listdir(images_path) if f.endswith(('jpg', 'png', 'jpeg'))]
>>> dominant_colors, percentages = extract_dominant_colors(os.path.join(images_path, image_files[2]))
>>> for col, percentage in zip(dominant_colors, percentages):
...     print(f"Color: {col}, Percentage: {percentage:.2f}%")
...
Color: #3a2f28, Percentage: 41.99%
Color: #dfcbac, Percentage: 32.76%
Color: #817875, Percentage: 25.24%

assess_image_quality Function

AdDownloader.analysis.assess_image_quality(image_path)[source]

Assesses the quality of an image based on its resolution, brightness, contrast, and sharpness.

This function calculates the resolution as the product of image width and height, evaluates brightness as the average value of the pixel intensity, measures contrast as the standard deviation of grayscale pixel intensities, and assesses sharpness by the variance of the Laplacian of the grayscale image. High values of sharpness and contrast generally indicate a higher-quality image.

Parameters:: image_path (str) – The file path of the image to assess.
Returns:: A tuple containing the resolution (as a single integer value of width multiplied by height), brightness (average pixel intensity), contrast (standard deviation of pixel intensities), and sharpness (variance of the Laplacian) of the image.
Return type:: tuple(int, float, float, float)

Example:

>>> resolution, brightness, contrast, sharpness = assess_image_quality(os.path.join(images_path, image_files[2]))
>>> print(f"Resolution: {resolution} pixels, Brightness: {brightness}, Contrast: {contrast}, Sharpness: {sharpness}")
Resolution: 188400 pixels, Brightness: 142.5308, Contrast: 71.3726, Sharpness: 3691.4007

analyse_image Function

AdDownloader.analysis.analyse_image(image_path)[source]

Analyzes an image by extracting its dominant colors, assessing image quality (resolution, brightness, contrast, sharpness), and detecting edges and corners.

This comprehensive analysis function performs multiple assessments on an image. It extracts the dominant colors and their proportions, calculates quality metrics such as resolution, brightness, contrast, and sharpness, and performs edge detection and corner detection to analyze the image’s structure. Additionally, it extracts an advertisement ID from the image file path using a predefined pattern.

Parameters:: image_path (str) – The file path of the image to be analyzed.
Returns:: A dictionary containing the analysis results, including ad ID, dominant colors and their proportions, resolution, brightness, contrast, sharpness, and the number of corners detected.
Return type:: dict

Example:

>>> analysis_result = analyse_image(os.path.join(images_path, image_files[2]))
>>> print(analysis_result)
{'ad_id': '1043287820216670', 'resolution': 188400, 'brightness': 142.53080148619958, 'contrast': 71.3726801705792, 'sharpness': 3691.40007606529,
'ncorners': 17, 'dom_color_1': '#817875', 'dom_color_1_prop': 41.943359375, 'dom_color_2': '#dfcbab', 'dom_color_2_prop': 32.8369140625, 'dom_color_3': '#3a2f28', 'dom_color_3_prop': 25.2197265625}

analyse_image_folder Function

AdDownloader.analysis.analyse_image_folder(folder_path, nr_images=None, project_name=None)[source]

Analyzes a set of images in a specified folder and returns a DataFrame with the results. When project_name is provided, the results are also exported to an Excel file at output/<project_name>/ads_data/<project_name>_image_analysis.xlsx.

This function iterates over image files in the specified folder, performing an analysis on each image using the analyse_image function. The analysis covers extracting dominant colors, assessing image quality (resolution, brightness, contrast, sharpness), and detecting edges and corners. The results are compiled into a pandas DataFrame.

Parameters:

folder_path (str) – The path to the folder containing the image files to be analyzed. The folder can contain images in jpg, png, and jpeg formats.
nr_images (int, optional) – The number of images to analyze from the folder. If None, all images in the folder are analyzed. This parameter allows for limiting the analysis to a subset of images.
project_name (str, optional) – The name of the current project. When provided, the analysis results are saved to an Excel file inside the project’s output folder.

Returns:

A pandas DataFrame containing the analysis results for each image.

Return type:

pandas.DataFrame

Example:

>>> df = analyse_image_folder(images_path, nr_images=20)
>>> # or, to also export the results to output/<project_name>/ads_data/<project_name>_image_analysis.xlsx:
>>> df = analyse_image_folder(images_path, nr_images=20, project_name="test1")
>>> df.head(3)
            ad_id  resolution  brightness   contrast    sharpness  ncorners dom_color_1  dom_color_1_prop dom_color_2  dom_color_2_prop dom_color_3  dom_color_3_prop
0  1039719343827470      187800  172.399936  60.601719  1585.668739        21     #ced2ce         55.395508     #a48b7d         28.369141     #464347         16.235352
1  1043131113478341      187800  108.217066  73.420019   903.498253        18     #1b1c17         45.996094     #96603d         33.593750     #dcbea0         20.410156
2  1043287820216670      188400  142.530801  71.372680  3691.400076        17     #3a2f28         41.992188     #817875         32.763672     #dfcbac         25.244141