How to Visualize and Customize Backlink Analysis with Python


Chances are you’ve used one of the more popular tools like Ahrefs or Semrush to analyze your site’s backlinks.

These tools crawl the web for a list of sites linking to your website along with a domain rating and other data describing the quality of your backlinks.

It’s no secret that backlinks play a big role in Google’s algorithm, so it makes sense to at least understand your own site before comparing it to the competition.

While using tools gives you insight into specific metrics, learning how to analyze backlinks yourself gives you more flexibility in what you’re measuring and how it’s presented.

And while you can do most analysis on a spreadsheet, Python has some advantages.

Besides the large number of rows it can handle, it can also more easily examine the statistical side, such as distributions.

In this column, you’ll find step-by-step instructions on how to view basic backlink analysis and customize your reports for different link attributes using Python.

Do not take a seat

We’ll take a small furniture industry website in the UK as an example and perform some basic analysis using Python.

So what is the value of a site’s backlinks for SEO?

At the simplest, I would say quality and quantity.

Quality is subjective to the expert but definitive to Google through metrics such as content authority and relevance.

We will start by evaluating the quality of the link with the available data before evaluating the quantity.

It’s time to code.

import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools  
pd.set_option('display.max_colwidth', None)
%matplotlib inline

root_domain = 'johnsankey.co.uk'
hostdomain = 'www.johnsankey.co.uk'
hostname="johnsankey"
full_domain = 'https://www.johnsankey.co.uk'
target_name="John Sankey"

We start by importing the data and cleaning up the column names to make it easier to manipulate and faster to type for later steps.

target_ahrefs_raw = pd.read_csv(
    'data/johnsankey.co.uk-refdomains-subdomains__2022-03-18_15-15-47.csv')

List comprehensions are a powerful and less intensive way to clean up column names.

target_ahrefs_raw.columns = [col.lower() for col in target_ahrefs_raw.columns]

The list comprehension instructs Python to convert the column name to lowercase for each column (“col”) in the columns of the dataframe.

target_ahrefs_raw.columns = [col.replace(' ','_') for col in target_ahrefs_raw.columns]
target_ahrefs_raw.columns = [col.replace('.','_') for col in target_ahrefs_raw.columns]
target_ahrefs_raw.columns = [col.replace('__','_') for col in target_ahrefs_raw.columns]
target_ahrefs_raw.columns = [col.replace('(','') for col in target_ahrefs_raw.columns]
target_ahrefs_raw.columns = [col.replace(')','') for col in target_ahrefs_raw.columns]
target_ahrefs_raw.columns = [col.replace('%','') for col in target_ahrefs_raw.columns]

Although not strictly necessary, I like to have a count column as standard for aggregations and a “project” single value column if I need to aggregate the entire table.

target_ahrefs_raw['rd_count'] = 1
target_ahrefs_raw['project'] = target_name
target_ahrefs_raw
backlink analysis in python Screenshot of Pandas, March 2022

We now have a database with clean column names.

The next step is to clean up the actual values ​​from the table and make them more useful for analysis.

Make a copy of the previous data frame and give it a new name.

target_ahrefs_clean_dtypes = target_ahrefs_raw

Clean up the dofollow_ref_domains column, which tells us the number of sitelink referring domains.

In this case, we’ll convert the dashes to zeros and then convert the entire column to an integer.

# referring_domains
target_ahrefs_clean_dtypes['dofollow_ref_domains'] = np.where(target_ahrefs_clean_dtypes['dofollow_ref_domains'] == '-',
                                                              0, target_ahrefs_clean_dtypes['dofollow_ref_domains'])
target_ahrefs_clean_dtypes['dofollow_ref_domains'] = target_ahrefs_clean_dtypes['dofollow_ref_domains'].astype(int)


# linked_domains
target_ahrefs_clean_dtypes['dofollow_linked_domains'] = np.where(target_ahrefs_clean_dtypes['dofollow_linked_domains'] == '-',
                                                           0, target_ahrefs_clean_dtypes['dofollow_linked_domains'])
target_ahrefs_clean_dtypes['dofollow_linked_domains'] = target_ahrefs_clean_dtypes['dofollow_linked_domains'].astype(int)

First_seen tells us the date when the link was first found.

We’ll convert the string to a date format that Python can process, and then use that to derive link ages later.

# first_seen
target_ahrefs_clean_dtypes['first_seen'] = pd.to_datetime(target_ahrefs_clean_dtypes['first_seen'], format="%d/%m/%Y %H:%M")

Converting first_seen to a date also means we can do time aggregations by month and year.

This is useful because it is not always the case that links to a site will be acquired daily, although it would be good for my own site if it was!

target_ahrefs_clean_dtypes['month_year'] = target_ahrefs_clean_dtypes['first_seen'].dt.to_period('M')

The age of the link is calculated by taking today’s date and subtracting the date of the first view.

Then it is converted into a number format and divided by a huge number to get the number of days.

# link age
target_ahrefs_clean_dtypes['link_age'] = datetime.datetime.now() - target_ahrefs_clean_dtypes['first_seen']
target_ahrefs_clean_dtypes['link_age'] = target_ahrefs_clean_dtypes['link_age']
target_ahrefs_clean_dtypes['link_age'] = target_ahrefs_clean_dtypes['link_age'].astype(int)
target_ahrefs_clean_dtypes['link_age'] = (target_ahrefs_clean_dtypes['link_age']/(3600 * 24 * 1000000000)).round(0)
target_ahrefs_clean_dtypes

backlink analysis ahrefs dataScreenshot of Pandas, March 2022

Once the data types are cleaned up and new data features created, the fun can begin!

Link quality

The first part of our analysis evaluates link quality, which summarizes the entire dataframe using the describe function to get descriptive statistics of all columns.

target_ahrefs_analysis = target_ahrefs_clean_dtypes
target_ahrefs_analysis.describe()

python backlink data tableScreenshot of Pandas, March 2022

So, from the table above, we can see the mean (average), the number of referring domains (107), and the variation (the 25th percentile and so on).

The average domain rating (equivalent to Moz’s Domain Authority) of referring domains is 27.

Is this a good thing?

In the absence of competitive data to compare in this market sector, it is difficult to know. This is where your experience as an SEO practitioner comes in.

However, I’m sure we could all agree that it could be higher.

How much higher to make a change is another question.

evaluation of the domain over the yearsScreenshot of Pandas, March 2022

The chart above can be a bit dry and difficult to visualize, so we’ll plot a histogram to get an intuitive understanding of referring domain authority.

dr_dist_plt = (
    ggplot(target_ahrefs_analysis, aes(x = 'dr')) + 
    geom_histogram(alpha = 0.6, fill="blue", bins = 100) +
    scale_y_continuous() +   
    theme(legend_position = 'right'))
dr_dist_plt
link data bar chartSscreenshot by author, March 2022

The distribution is highly skewed, showing that most referring domains have an authority rating of zero.

Beyond zero, the distribution seems fairly even, with an equal number of domains at different levels of authority.

The age of the links is another important factor for SEO.

Let’s see the distribution below.

linkage_dist_plt = (
    ggplot(target_ahrefs_analysis, 
           aes(x = 'link_age')) + 
    geom_histogram(alpha = 0.6, fill="blue", bins = 100) +
    scale_y_continuous() +   
    theme(legend_position = 'right'))
linkage_dist_plt
bar chart for link ageScreenshot by author, March 2022

The distribution seems more normal although it is still skewed, with the majority of the links being new.

The most common link age appears to be around 200 days, or less than a year, suggesting that most links were acquired recently.

For interest, let’s see how this correlates with Domain Authority.

dr_linkage_plt = (
    ggplot(target_ahrefs_analysis, 
           aes(x = 'dr', y = 'link_age')) + 
    geom_point(alpha = 0.4, colour="blue", size = 2) +
    geom_smooth(method = 'lm', se = False, colour="red", size = 3, alpha = 0.4)
)

print(target_ahrefs_analysis['dr'].corr(target_ahrefs_analysis['link_age']))
dr_linkage_plt

0.1941101232345909
link age data tableScreenshot by author, March 2022

The chart (as well as the 0.19 figure printed above) shows no correlation between the two.

And why should there be?

A correlation would only imply that the higher authority links were acquired in the early phase of the site’s history.

The reason for the non-correlation will become more apparent later.

We will now look at link quality over time.

If we were to literally plot the number of links per date, the time series would look rather messy and less useful as shown below (no code provided to render the chart).

To do this, we will calculate a moving average of the Domain Rating by month of the year.

Note the expand() function, which instructs Pandas to include all previous lines with each new line.

target_rd_cummean_df = target_ahrefs_analysis
target_rd_mean_df = target_rd_cummean_df.groupby(['month_year'])['dr'].sum().reset_index()
target_rd_mean_df['dr_runavg'] = target_rd_mean_df['dr'].expanding().mean()
target_rd_mean_df
calculate a moving average of the domain valuationScreenshot of Pandas, March 2022

We now have a table that we can use to populate the graph and visualize it.

dr_cummean_smooth_plt = (
    ggplot(target_rd_mean_df, aes(x = 'month_year', y = 'dr_runavg', group = 1)) + 
    geom_line(alpha = 0.6, colour="blue", size = 2) +
    scale_y_continuous() +
    scale_x_date() +
    theme(legend_position = 'right', 
          axis_text_x=element_text(rotation=90, hjust=1)
         ))
dr_cummean_smooth_plt
view the cumulative average score of the domainScreenshot by author, March 2022

This is quite interesting as it seems the site started attracting high authority links early in its time (likely a PR campaign launching the business).

It then faded for four years before resuming with a new link acquiring high authority links again.

Volume of links

It sounds good just by writing this title!

Who wouldn’t want a large volume of (good) links to their site?

Quality is one thing; the volume is another, this is what we will analyze next.

Just like the previous operation, we will use the expansion function to calculate a cumulative sum of the links acquired to date.

target_count_cumsum_df = target_ahrefs_analysis
target_count_cumsum_df = target_count_cumsum_df.groupby(['month_year'])['rd_count'].sum().reset_index()
target_count_cumsum_df['count_runsum'] = target_count_cumsum_df['rd_count'].expanding().sum()
target_count_cumsum_df
calculation of the cumulative sum of the linksScreenshot of Pandas, March 2022

That’s the data, now the chart.

target_count_cumsum_plt = (
    ggplot(target_count_cumsum_df, aes(x = 'month_year', y = 'count_runsum', group = 1)) + 
    geom_line(alpha = 0.6, colour="blue", size = 2) +
    scale_y_continuous() + 
    scale_x_date() +
    theme(legend_position = 'right', 
          axis_text_x=element_text(rotation=90, hjust=1)
         ))
target_count_cumsum_plt
line graph of cumulative sum of linksScreenshot by author, March 2022

We see that links acquired in early 2017 slowed but added steadily over the next four years before accelerating again around March 2021.

Again, it would be good to correlate this with performance.

Go further

Of course, the above is just the tip of the iceberg, as it is a simple site exploration. It’s hard to infer anything useful to improve rankings in competitive search spaces.

Below are a few areas for further data exploration and analysis.

  • Add social media sharing data to both destination URLs.
  • Correlate Overall Site Visibility with Current Average DR overtime.
  • Plot the distribution of DR overtime.
  • Added search volume data on hostnames to see how many brand searches referring domains receive as a measure of true authority.
  • Join with crawl data destination URLs to test content relevance.
  • Link speed – the rate at which new links from new sites are acquired.
  • Incorporate all the ideas above in your analysis to compare yourself to your competitors.

I’m sure there are plenty of ideas not listed above, feel free to share them below.

More resources:


Featured image: metamorworks/Shutterstock

Leave a Comment

Your email address will not be published.