Over a million developers have joined DZone.

A Delicious Analysis: Topic Modelling Using Recipes

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

A few months ago, I saw a link on Twitter to an awesome graph charting the similarities of different foods based on their flavour compounds, in addition to their prevalence in recipes (see the whole study, The Flavor Network and the Principles of Food Pairing). I thought this was really neat and became interested in potentially using the data for something slightly different; to figure out which ingredients tended to correlate across recipes.  I emailed one of the authors, Yong-Yeol Ahn, who is a real mensch by the way, and he let me know that the raw recipe data is readily available on his website!

Given my goal of looking for which ingredients correlate across recipes, I figured this would be the perfect opportunity to use topic modelling (here I use Latent Dirichlet Allocation or LDA).  Usually in topic modelling you have a lot of filtering to do.  Not so with these recipe data, where all the words (ingredients) involved in the corpus are of potential interest, and there aren’t even any punctuation marks!  The topics coming out of the analysis would represent clusters of ingredients that co-occur with one another across recipes, and would possibly teach me something about cooking (of which I know precious little!).

All my code is at the bottom, so all you’ll find up here are graphs and my textual summary.  The first thing I did was to put the 3 raw recipe files together using python.  Each file consisted of one recipe per line, with the cuisine of the recipe as the first entry on the line, and all other entries (the ingredients) separated by tab characters.  In my python script, I separated out the cuisines from the ingredients, and created two files, one for the recipes, and one for the cuisines of the recipes.

Then I loaded up the recipes into R and got word/ingredient counts.  As you can see below, the 3 most popular ingredients were egg, wheat, and butter.  It makes sense, considering the fact that roughly 70% of all the recipes fall under the “American” cuisine.  I did this analysis for novelty’s sake, and so I figured I would take those ingredients out of the running before I continued on.  Egg makes me fart, wheat is not something I have at home in its raw form, and butter isn’t important to me for the purpose of this analysis!

Recipe Popularity of Top 30 Ingredients

Here are the top ingredients without the three filtered out ones:

Recipe Popularity of Top 30 Ingredients - No Egg Wheat or Butter

Finally, I ran the LDA, extracting 50 topics, and the top 5 most characteristic ingredients of each topic.  You can see the full complement of topics at the bottom of my post, but I thought I’d review some that I find intriguing.  You will, of course, find other topics intriguing, or some to be bizarre and inappropriate (feel free to tell me in the comment section).  First, topic 4:

[1] "tomato"  "garlic"  "oregano" "onion"   "basil"

Here’s a cluster of ingredients that seems decidedly Italian.  The ingredients seem to make perfect sense together, and so I think I’ll try them together next time I’m making pasta (although I don’t like tomatoes in their original form, just tomato sauce).

Next, topic 19:

[1] "vanilla" "cream"   "almond"  "coconut" "oat"

This one caught my attention, and I’m curious if the ingredients even make sense together.  Vanilla and cream makes sense… Adding coconut would seem to make sense as well.  Almond would give it that extra crunch (unless it’s almond milk!).  I don’t know whether it would be tasty however, so I’ll probably pass this one by.

Next, topic 20:

[1] "onion"         "black_pepper"  "vegetable_oil" "bell_pepper"   "garlic"

This one looks tasty!  I like spicy foods and so putting black pepper in with onion, garlic and bell pepper sounds fun to me!

Next, topic 23:

[1] "vegetable_oil" "soy_sauce"     "sesame_oil"    "fish"          "chicken"

Now we’re into the meaty zone!  I’m all for putting sauces/oils onto meats, but putting vegetable oil, soy sauce, and sesame oil together does seem like overkill.  I wonder whether soy sauce shows up with vegetable oil or sesame oil separately in recipes, rather than linking them all together in the same recipes.  I’ve always liked the extra salty flavour of soy sauce, even though I know it’s horrible for you as it has MSG in it.  I wonder what vegetable oil, soy sauce, and chicken would taste like.  Something to try, for sure!

Now, topic 26:

[1] "cumin"      "coriander"  "turmeric"   "fenugreek"  "lemongrass"

These are a whole lot of spices that I never use on my food.  Not for lack of wanting, but rather out of ignorance and laziness.  One of my co-workers recently commented that cumin adds a really nice flavour to food (I think she called it “middle eastern”).  I’ve never heard a thing about the other spices here, but why not try them out!

Next, topic 28:

[1] "onion"       "vinegar"     "garlic"      "lemon_juice" "ginger"

I tend to find that anything with an intense flavour can be very appetizing for me.  Spices, vinegar, and anything citric are what really register on my tongue.  So, this topic does look very interesting to me, probably as a topping or a sauce.  It’s interesting that ginger shows up here, as that neutralizes other flavours, so I wonder whether I’d include it in any sauce that I make?

Last one!  Topic 41:

[1] "vanilla"  "cocoa"    "milk"     "cinnamon" "walnut"

These look like the kinds of ingredients for a nice drink of some sort (would you crush the walnuts?  I’m not sure!)

Well, I hope you enjoyed this as much as I did!  It’s not a perfect analysis, but it definitely is a delicious one :)  Again, feel free to leave any comments about any of the ingredient combinations, or questions that you think could be answered with a different analysis!

import os
rfiles = os.listdir('.')
rc = []
for f in rfiles:
    if '.txt' in f: 
    # The recipes come in 3 txt files consisting of 1 recipe per line, the 
    # cuisine of the recipe as the first entry in the line, and all subsequent ingredient
    # entries separated by a tab
        infile = open(f, 'r')
all_rs = '\n'.join(rc)
import re
line_pat = re.compile('[A-Za-z]+\t.+\n')
recipe_lines = line_pat.findall(all_rs)
new_recipe_lines = []
cuisine_lines = []
for n,r in enumerate(recipe_lines):
    # First we find the cuisine of the recipe
    cuisine = r[:r.find('\t')]
    # Then we append the ingredients withou the cuisine
    new_recipe_lines.append(recipe_lines[n].replace(cuisine, ''))
    # I saved the cuisines to a different list in case I want to do some 
    # cuisine analysis later
    cuisine_lines.append(cuisine + '\n')

outfile1 = open('recipes combined.tsv', 'wb')

outfile2 = open('cuisines.csv', 'wb')
recipes = readLines('recipes combined.tsv')

# Once I read it into R, I have to get rid of the /t
# characters so that it's more acceptable to the tm package

recipes.new = apply(as.matrix(recipes), 1, function (x) gsub('\t',' ', x))

recipes.corpus = Corpus(VectorSource(recipes.new))
recipes.dtm = DocumentTermMatrix(recipes.corpus)

# Now I filter out any terms that have shown up in less than 10 documents

recipes.dict = Dictionary(findFreqTerms(recipes.dtm,10))
recipes.dtm.filtered = DocumentTermMatrix(recipes.corpus, list(dictionary = recipes.dict))

# Here I get a count of number of ingredients in each document
# with the intent of deleting any documents with 0 ingredients

ingredient.counts = apply(recipes.dtm.filtered, 1, function (x) sum(x))
recipes.dtm.filtered = recipes.dtm.filtered[ingredient.counts > 0]

# Here i get some simple ingredient frequencies so that I can plot them and decide 
# which I'd like to filter out

recipes.m = as.matrix(recipes.dtm.filtered)
popularity.of.ingredients = sort(colSums(recipes.m), decreasing=TRUE)
popularity.of.ingredients = data.frame(ingredients = names(popularity.of.ingredients), num_recipes=popularity.of.ingredients)
popularity.of.ingredients$ingredients = reorder(popularity.of.ingredients$ingredients, popularity.of.ingredients$num_recipes)


ggplot(popularity.of.ingredients[1:30,], aes(x=ingredients, y=num_recipes)) + geom_point(size=5, colour="red") + coord_flip() +
ggtitle("Recipe Popularity of Top 30 Ingredients") + 
theme(axis.text.x=element_text(size=13,face="bold", colour="black"), axis.text.y=element_text(size=13,colour="black",
face="bold"), axis.title.x=element_text(size=14, face="bold"), axis.title.y=element_text(size=14,face="bold"),

# Having found wheat, egg, and butter to be the three most frequent ingredients
# (and not caring too much about them as ingredients in general) I remove them 
# from the corpus and redo the document term matrix

recipes.corpus = tm_map(recipes.corpus, removeWords, c("wheat","egg","butter"))  # Go back to line 6
recipes.dtm.final = DocumentTermMatrix(recipes.corpus, list(dictionary = recipes.dict))

# Finally, I run the LDA and extract the 5 most
# characteristic ingredients in each topic... yummy!

recipes.lda = LDA(recipes.dtm.filtered, 50)
t = terms(recipes.lda,5)

     Topic 1         Topic 2   Topic 3         Topic 4   Topic 5        Topic 6    Topic 7       Topic 8     Topic 9        
[1,] "onion"         "pepper"  "milk"          "tomato"  "olive_oil"    "milk"     "milk"        "tomato"    "garlic"       
[2,] "rice"          "vinegar" "vanilla"       "garlic"  "garlic"       "nutmeg"   "pepper"      "cayenne"   "cream"        
[3,] "cayenne"       "onion"   "cocoa"         "oregano" "onion"        "vanilla"  "yeast"       "olive_oil" "vegetable_oil"
[4,] "chicken_broth" "tomato"  "onion"         "onion"   "black_pepper" "cinnamon" "potato"      "garlic"    "pepper"       
[5,] "olive_oil"     "milk"    "cane_molasses" "basil"   "vinegar"      "cream"    "lemon_juice" "pepper"    "milk"         
     Topic 10        Topic 11              Topic 12        Topic 13       Topic 14    Topic 15   Topic 16        Topic 17       
[1,] "milk"          "soy_sauce"           "vegetable_oil" "onion"        "milk"      "tamarind" "milk"          "vegetable_oil"
[2,] "cream"         "scallion"            "milk"          "black_pepper" "cinnamon"  "onion"    "vanilla"       "pepper"       
[3,] "vanilla"       "sesame_oil"          "pepper"        "vinegar"      "onion"     "garlic"   "cream"         "cream"        
[4,] "cane_molasses" "cane_molasses"       "cane_molasses" "bell_pepper"  "cayenne"   "corn"     "vegetable_oil" "black_pepper" 
[5,] "cinnamon"      "roasted_sesame_seed" "cinnamon"      "bacon"        "olive_oil" "vinegar"  "garlic"        "mustard"      
     Topic 18        Topic 19  Topic 20        Topic 21        Topic 22    Topic 23        Topic 24        Topic 25       Topic 26    
[1,] "cane_molasses" "vanilla" "onion"         "garlic"        "onion"     "vegetable_oil" "onion"         "cream"        "cumin"     
[2,] "onion"         "cream"   "black_pepper"  "cane_molasses" "garlic"    "soy_sauce"     "garlic"        "tomato"       "coriander" 
[3,] "vinegar"       "almond"  "vegetable_oil" "vinegar"       "tomato"    "sesame_oil"    "cane_molasses" "chicken"      "turmeric"  
[4,] "olive_oil"     "coconut" "bell_pepper"   "black_pepper"  "olive_oil" "fish"          "tomato"        "lemon_juice"  "fenugreek" 
[5,] "pepper"        "oat"     "garlic"        "soy_sauce"     "basil"     "chicken"       "vegetable_oil" "black_pepper" "lemongrass"
     Topic 27       Topic 28      Topic 29        Topic 30    Topic 31   Topic 32        Topic 33       Topic 34        Topic 35   
[1,] "onion"        "onion"       "onion"         "onion"     "vanilla"  "garlic"        "onion"        "onion"         "garlic"   
[2,] "garlic"       "vinegar"     "celery"        "pepper"    "milk"     "onion"         "pepper"       "garlic"        "basil"    
[3,] "black_pepper" "garlic"      "chicken"       "garlic"    "garlic"   "vegetable_oil" "garlic"       "vegetable_oil" "pepper"   
[4,] "tomato"       "lemon_juice" "vegetable_oil" "parsley"   "cinnamon" "cayenne"       "black_pepper" "black_pepper"  "tomato"   
[5,] "olive_oil"    "ginger"      "carrot"        "olive_oil" "cream"    "beef"          "beef"         "chicken"       "olive_oil"
     Topic 36        Topic 37       Topic 38        Topic 39  Topic 40      Topic 41   Topic 42        Topic 43   Topic 44       
[1,] "onion"         "onion"        "onion"         "cayenne" "garlic"      "vanilla"  "vanilla"       "scallion" "milk"         
[2,] "garlic"        "garlic"       "cream"         "garlic"  "onion"       "cocoa"    "cane_molasses" "garlic"   "tomato"       
[3,] "cayenne"       "black_pepper" "tomato"        "ginger"  "bell_pepper" "milk"     "cocoa"         "ginger"   "garlic"       
[4,] "vegetable_oil" "lemon_juice"  "cane_molasses" "rice"    "olive_oil"   "cinnamon" "oat"           "soybean"  "vegetable_oil"
[5,] "oregano"       "scallion"     "milk"          "onion"   "milk"        "walnut"   "milk"          "pepper"   "cream"        
     Topic 45       Topic 46        Topic 47        Topic 48 Topic 49        Topic 50         
[1,] "onion"        "cream"         "pepper"        "cream"  "milk"          "olive_oil"      
[2,] "cream"        "black_pepper"  "vegetable_oil" "tomato" "vanilla"       "tomato"         
[3,] "black_pepper" "chicken_broth" "garlic"        "beef"   "lard"          "parmesan_cheese"
[4,] "milk"         "vegetable_oil" "onion"         "garlic" "cocoa"         "lemon_juice"    
[5,] "cinnamon"     "garlic"        "olive_oil"     "carrot" "cane_molasses" "garlic"   

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks


Published at DZone with permission of Matthew Dubins, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}