Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Building Your Own Instagram Discovery Engine: A Step-By-Step Tutorial

DZone's Guide to

Building Your Own Instagram Discovery Engine: A Step-By-Step Tutorial

Learn how to build your own Instagram discovery engine full of personalized, relevant content curated for you via neural networks and visual recognition by AI.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

Isn't it great how Instagram's "Explore" section displays content that matches your interests? When you open the application, the content and recommendations shown are almost always relevant to your specific likes, interests, connections, etc. While it may be fun to think we're the center of the Instagram universe, the reality is that personalized, relevant content is also uniquely curated for 400 million other people daily. With 400M active users and 80M photos posted daily, how does Instagram decide what to put on your Explore section? Let's the key factors Instagram uses to determine scores for posts in your Instagram timeline and explore section.

Before we get into the nitty-gritty, here are some features Instagram uses to determine what content to serve up:

  • Timing: The more recent the post, the higher the score.
  • Engagement: Could be determined by the number of likes, comments, and/or views. If a user engages with certain tags more often, such as #snowboarding, that user will be shown more images of snowboarding.
  • Previous interactions: How often you have interacted with this user in the past. The more you engage with certain users indicates how relevant their posts are to you.
  • Affinity: How you are related to this person. A friend of a friend, a friend that you haven't connected with yet, or someone you don't know?

Now, let's use these features to build our own Instagram discovery engine. In order to query data from Instagram, I am going to use the very cool, yet unofficial, Instagram API written by Pasha Lev. For Mac users, the following should get you up and running. All other libraries are pip-installable, and all Python code was run within a Jupyter notebook.

Step 1: Set Up Jupyter Notebook and Dependencies

To get up and running, run the following in your terminal:

brew install libxmlsec1 ffmpeg
pip install pandas tqdm jupyter networkx
pip install -e "git+https://github.com/LevPasha/Instagram-API-python.git#egg=InstagramAPI"

View the code on Gist.

Then run Jupyter notebook in your terminal , which will open in your default browser. I would also recommend verifying your Instagram phone number before continuing. This will prevent some unexpected redirects.

Now on to the good stuff. Let's start with finding my social network and a bit of graph analysis.

from InstagramAPI import InstagramAPI
from tqdm import tqdm
import pandas as pd
api = InstagramAPI("username", "password")
api.login() # login

View the code on Gist.

If all goes well, you should get a Login success! response.

We can now build a true social network by finding everyone I follow, as well as everyone they follow. For a quick intro on social network analysis and personalized page rank, take a look at this blog post.

Before stepping into the code, let's take a look at my own profile to see what we're trying to analyze.

As you can see, I follow 42 people, who are considered my immediate network, which isn't too many. If we start to look at second-degree connections, that number quickly grows. In my case, if we look at second-degree connections, the number of nodes reaches over 24,000. A nice visualization of this can be seen in Step 2.

#first let's figure out who I am
api.getSelfUsernameInfo()
result = api.LastJson
user_id = result['user']['pk'] # my own personal user id
me = result['user']['full_name'] # my own personal username

api.getSelfUsersFollowing()
result = api.LastJson
follow_relationships = []
for user in tqdm(result['users']):
    followed_user_id = user['pk'] 
    followed_user_name = user['full_name']
    follow_relationships.append((user_id, followed_user_id, me, followed_user_name))
    api.getUserFollowings(followed_user_id)
    result2 = api.LastJson
    if result2.get('users') is not None:
        for user2 in result2['users']:
            follow_relationships.append((followed_user_id, user2['pk'],
                                         followed_user_name, user2['full_name']))

View the code on Gist.

Cool! Now, let's get that into a nicely formatted Pandas DataFrame.

df = pd.DataFrame(follow_relationships,
                  columns=['src_id','dst_id', 'src_name', 'dst_name'])

View the code on Gist.

Step 2: Network Visualization (Optional)

While it's not essential to visualize your network in order to build your own discovery engine, it is pretty interesting and may help with understanding personal page ranks. I'm going to use one of my new favorite graph visualizations library, Graphistry (check them out sometime). However, if you don't want to wait around for an API key (though I got a same-day response), there are lots of other good libraries such as and NetworkX.

import graphistry
graphistry.register(key='Email pygraphistry@graphistry.com for an API key!')
graphistry.bind(source='src_name', destination='dst_name').edges(df).plot()

View the code on Gist.

For this example, I'm going to display to src_id and dst_id to give my friends a bit of privacy, though it is pretty fun to display usernames (which is what the below code will do). The first graph only displays edges that are sourced from me and filtered using the built-in tools in Graphistry.

The second graph shows all of my extended network.

Isn't that cool? You can already see a couple interesting features such as the few external centroids and how they interact with the rest of my social network.

Step 3: Finding Top Images From Social Network

It's now time to grab the most recent images from everyone and rate them by how relevant they are to me. Since there about 24,000 nodes, it may take a while to download all the data.

Let's do a quick trial run of only the 42 people I immediately follow to make sure we're on the right track.

Based on what I thought might determine the relative score of Instagram posts, we need to grab the number of likes, number of comments, and the time the photo was takenfor all recent photos of people I follow (in this example, I considered "recent" to be equivalent to one week and cut off photos older than that). It would also be useful to grab how many times I've "liked" that user's posts and how connected that person is to me. Everything besides how connected that user is to me is a simple sum. To calculate the connected piece, we'll use a personalized page rank. Once we've compiled that information, we can define an importance metric like:

Alright, now that we have that defined, let's see how it works! I apologize for the big chunk of code coming up, but don't you worry... there is a picture of my new puppy at the end!

from IPython.display import Image, display
import networkx as nx
import numpy as np
import time
import datetime


#get all users that I am directly following
api.getSelfUsersFollowing()
result = api.LastJson
follow_relationships = []
for user in tqdm(result['users']):
    followed_user_id = user['pk'] 
    followed_user_name = user['full_name']
    follow_relationships.append((user_id, followed_user_id, me, followed_user_name))
  
df_local = pd.DataFrame(follow_relationships, columns=['src_id','dst_id', 'src_name', 'dst_name'])
all_user_ids_local = np.unique(df_local[['src_id', 'dst_id']].values.reshape(1,-1))



#grab all my likes from the past year
last_year = datetime.datetime.now() - datetime.timedelta(days=365)
now = datetime.datetime.now()
last_result_time = now
all_likes = []
max_id = 0

while last_result_time > last_year:
    api.getLikedMedia(maxid=max_id)
    results = api.LastJson
    [all_likes.append(item) for item in results['items']]
    max_id = results['items'][-1]['pk']
    last_result_time = pd.to_datetime(results['items'][-1]['taken_at'], unit='s')

like_counts = pd.Series([i['user']['pk'] for i in all_likes]).value_counts()   


#calculate number of times I've liked each users post
for i in tqdm(like_counts.index):
    if i in df_local['dst_id'].values: # only count likes from people I follow (naive but simple)
        ind = df_local[(df_local['src_id'] == user_id) & (df_local['dst_id'] == i)].index[0]
        if like_counts[i] is not None:
            df_local = df_local.set_value(ind,'weight',like_counts[i])
ind = df_local[df_local['weight'].isnull()].index
df_local = df_local.set_value(ind,'weight',0.5)

#create social graph and calculate pagerank
G = nx.from_pandas_dataframe(df_local, 'src_id', 'dst_id')
#calculate personalized pagerank
perzonalization_dict = dict(zip(G.nodes(), [0]*len(G.nodes())))
perzonalization_dict[user_id] = 1
ppr = nx.pagerank(G, personalization=perzonalization_dict)

#this may take a while if you follow a lot of people
urls = []
taken_at = []
num_likes = []
num_comments = []
page_rank = []
users = []
weight = []
for user_id in tqdm(all_user_ids_local):
    api.getUserFeed(user_id)
    result = api.LastJson
    if 'items' in result.keys():
        for item in result['items']:
            if 'image_versions2' in item.keys(): #only grabbing pictures (no videos or carousels)
                # make sure we can grab keys before trying to append
                
                url = item['image_versions2']['candidates'][1]['url']
                taken = item['taken_at']
                try:
                    likes = item['like_count']
                except KeyError:
                    likes = 0
                try:
                    comments = item['comment_count']
                except KeyError:
                    comments = 0
                
                pr = ppr[item['user']['pk']]
                user = item['user']['full_name']
                if user != me: #don't count myself!
                    urls.append(url)
                    taken_at.append(taken)
                    num_likes.append(likes)
                    num_comments.append(comments)
                    page_rank.append(pr)
                    users.append(user)
                    weight.append(df_local[df_local['dst_name'] == user]['weight'].values[0])

    
#now we can make a dataframe with all of that information  
scores_df = pd.DataFrame(
    {'urls': urls,
     'taken_at': taken_at,
     'num_likes': num_likes,
     'num_comments': num_comments,
     'page_rank': page_rank,
     'users': users,
     'weight': weight
    })
#don't care about anything older than 1 week
oldest_time = int((datetime.datetime.now()
                   - datetime.timedelta(weeks = 1)).strftime('%s'))

scores_df = scores_df[scores_df['taken_at'] > oldest_time]

# /1e5 to help out with some machine precision (numbers get real small otherwise)
scores_df['time_score'] = np.exp(-(int(time.time()) - scores_df['taken_at'])/1e5)

scores_df['total_score'] = (np.log10(scores_df['num_comments']+2) * np.log10(scores_df['num_likes']+1)
                            * scores_df['page_rank'] * scores_df['time_score']
                            * np.log(scores_df['weight']+1))

        
# calculate top ten highest rated posts
top_ten = scores_df['total_score'].nlargest(10)
top_rows = scores_df.loc[top_ten.index].values
top_personal_img = []


top_graph_img = []
#display the feed
for row in top_rows:
    img = Image(row[4], format='jpeg')
    top_graph_img.append(img)
    display(img)
    top_personal_img.append(img)
    print('taken_at: %s' % time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(row[3])) )
    print('number of likes: %s' % row[1])
    print('number of comments: %s' % row[0])
    print('page_rank: %s' % row[2])
    print(row[5])

View the code on Gist.

Which gives me:

This actually looks very similar to my personal timeline — cool! Now that we know we're onto something, let's tackle the discovery section.

Step 4: Extending User Base

We can take the same approach as before by calculating the relative score of each photo of friends of friends. To do this, we'd start with the first social graph that we calculated... but that has over 24K nodes and I'm too lazy to wait for all the data. Instead, let's grab photos of friends of friends whose posts I've "liked." This drops the number of nodes down to just over 1,500 which, depending on your internet speed, is the perfect amount of time for a coffee break.

There are a couple minor tweaks to the above code that are needed to deal with the extended user base, but most of the code is the same.

#get my info
api.getSelfUsernameInfo()
result = api.LastJson
user_id = result['user']['pk'] # my own personal user id
me = result['user']['full_name'] # my own personal username

#get photos that I've liked
api.getLikedMedia()
result = api.LastJson
users = [item['user'] for item in result['items']]

# build up extended social network
follow_relationships = []
for user in tqdm(users):
    followed_user_id = user['pk'] 
    followed_user_name = user['full_name']
    follow_relationships.append((user_id, followed_user_id, me, followed_user_name))
    api.getUserFollowings(followed_user_id)
    result2 = api.LastJson
    for user2 in result2['users']:
        follow_relationships.append((followed_user_id, user2['pk'], followed_user_name, user2['full_name']))

df_global = pd.DataFrame(follow_relationships, columns=['src_id','dst_id', 'src_name', 'dst_name'])
all_user_ids_global = np.unique(df_global[['src_id', 'dst_id']].values.reshape(1,-1))


#create social graph and calculate pagerank
G = nx.from_pandas_dataframe(df_global, 'src_id', 'dst_id')
#calculate personalized pagerank
perzonalization_dict = dict(zip(G.nodes(), [0]*len(G.nodes())))
perzonalization_dict[user_id] = 1
ppr = nx.pagerank(G, personalization=perzonalization_dict)

#this may take a while if you follow a lot of people
urls = []
taken_at = []
num_likes = []
num_comments = []
page_rank = []
users = []
for user_id in tqdm(all_user_ids_global):
    api.getUserFeed(user_id)
    result = api.LastJson
    if 'items' in result.keys():
        for item in result['items']:
            if 'image_versions2' in item.keys(): #only grabbing pictures (no videos or carousels)
                # make sure we can grab keys before trying to append
                
                url = item['image_versions2']['candidates'][1]['url']
                taken = item['taken_at']
                try:
                    likes = item['like_count']
                except KeyError:
                    likes = 0
                try:
                    comments = item['comment_count']
                except KeyError:
                    comments = 0
                
                pr = ppr[item['user']['pk']]
                user = item['user']['full_name']
                if user != me: #don't count myself!
                    urls.append(url)
                    taken_at.append(taken)
                    num_likes.append(likes)
                    num_comments.append(comments)
                    page_rank.append(pr)
                    users.append(user)
                    

    
#now we can make a dataframe with all of that information  
scores_df = pd.DataFrame(
    {'urls': urls,
     'taken_at': taken_at,
     'num_likes': num_likes,
     'num_comments': num_comments,
     'page_rank': page_rank,
     'users': users,
    })

#don't care about anything older than 1 week
oldest_time = int((datetime.datetime.now()
                   - datetime.timedelta(weeks = 1)).strftime('%s'))

# For a discovery feed we don't want to show photos of poeple we already follow
scores_df = scores_df[~scores_df['users'].isin(df_local['dst_name'])]
scores_df = scores_df[scores_df['taken_at'] > oldest_time]

# /1e5 to help out with some machine precision (numbers get real small otherwise)
scores_df['time_score'] = np.exp(-(int(time.time()) - scores_df['taken_at'])/1e5)

scores_df['total_score'] = (np.log10(scores_df['num_comments']+2) * np.log10(scores_df['num_likes']+1)
                            * scores_df['page_rank'] * scores_df['time_score'])

        
# calculate top ten highest rated posts
top_ten = scores_df['total_score'].nlargest(10)
top_rows = scores_df.loc[top_ten.index].values
top_personal_img = []

top_graph_img = []
#display the feed
for row in top_rows:
    img = Image(row[4], format='jpeg')
    top_graph_img.append(img)
    display(img)
    top_personal_img.append(img)
    print('taken_at: %s' % time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(row[3])) )
    print('number of likes: %s' % row[1])
    print('number of comments: %s' % row[0])
    print('page_rank: %s' % row[2])
    print(row[5])

View the code on Gist.

The results ended up showing a lot of images from National Geographic and Red Bull, which I currently don't follow, but might start to now!

Step 5: Interest-Based Analysis

Interests haven't yet been taken into account just yet. A nice aspect of Instagram is its rich set of #hashtags used to describe photos. Let's see if we can discover my interests by using the hashtags of photos I've "liked," and photos that I've been tagged in. While Instagram most likely uses click data alongside "like" data, we don't have access to clicks, so we're going to stick with likes only.

import re
def find_hastags(text):
    return re.findall(r'#(\w+)', text)

hashtags = []
#photos that I've recently liked
api.getLikedMedia()
result = api.LastJson
for r in tqdm(result['items']):
    if r['caption'] is not None:
        hashtag = find_hastags(r['caption']['text'])
        
#photos that i have been recently tagged in 
api.getSelfUserTags()
result = api.LastJson
for r in result['items']:
    if r['caption'] is not None:
        hashtag = find_hastags(r['caption']['text'])
        [hashtags.append(i.lower()) for i in hashtag]

top_hashtags = pd.Series(hashtags).value_counts()[:10]
print(top_hashtags.index)

View the code on Gist.

Which gives:

Index(['adventure', 'climbing_pictures_of_Instagram', 'love', 'optoutside',
       'mtb', 'mountains', 'sunset', 'skiuphill', 'earnyourturns', 'colorado'],
      dtype='object')

View the code on Gist.

Now, let's grab the most popular images for each of those tags:

interest_dict = {}
for tag in top_hashtags.index:    
    api.getHashtagFeed(tag)
    result = api.LastJson
    interest_dict[tag] = result

urls = []
num_likes = []
tags = []
for tag in interest_dict.keys():
    items = interest_dict[tag]['items']
    for i in items:
        if 'image_versions2' in i:
            urls.append(i['image_versions2']['candidates'][1]['url'])
            num_likes.append(i['like_count'])
            tags.append(tag)
    
df_tags = pd.DataFrame(
    {'urls': urls,
     'num_likes': num_likes,
     'tag': tags
    })

top_tags = df_tags.groupby('tag').max()

View the code on Gist.

Now that we have the most popular image from each hashtag feed, we can display them.

images_top_tags = []
for tag in top_tags.index:
    img = Image(top_tags.loc[tag]['urls'], format='jpeg')
    display(img)
    print(tag)
    images_top_tags.append(img)

View the code on Gist.

Now let's combine these two techniques.

Step 6: Putting It All Together

You may have noticed I was saving all the collected image data to top_graph_img and images_top_tags. Let's combine them using a fairly naive technique, random sampling:

import random
all_images = top_graph_img + images_top_tags
for image in random.sample(all_images, 6):
    display(image)

View the code on Gist.

That's not too shabby! I personally find some of those photos pretty cool, but it definitely could be better.

Ways to improve the discovery engine:

  • With access to the entire social graph, we could run a similar analysis with weights between nodes, determined by the number of likes and comments.
  • Combine click data alongside 'like" data to take implicit feedback and engagement metrics. This can be extremely useful for downgrading more clickbait style posts that don't have very many likes and showing interests of users who don't tend to "like" very often.
  • Calculate image features using convolutional neural nets. Remove the final dense layers, then calculate and display similar images to those the user has liked based on those features. Integrate Facebook's social network to display images of people you're connected with.
  • Use Matrix Factorization to see if we can recommend content. You could even use image features and hashtags to construct feature vectors for hybrid techniques.
  • Use natural language processing (NLP) and clustering techniques to find similar hashtags (even ones with emojis!). This provides normalization of hashtags (bike vs. biking) and similarity metrics (nature vs. mountain).

This is by no means is an exhaustive list, so if you have any other ideas, please let me know!

Happy building!

Also published on Medium.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
ai ,personalization ,neural network ,jupyter ,data analytics ,visual recognition ,data visualization ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}