Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Taxonomy with R: Exploring the Taxize-Package

DZone's Guide to

Taxonomy with R: Exploring the Taxize-Package

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

First off, I'd really like to give a shout-out to the brave people who have created and maintain this great package - the fame is yours!

So, while exploring the capabilities of the package some issues with the ITIS-Server arose and with large datasets things weren't working out quite well for me.

I then switched to the NCBI API and saw that the result is much better here (way quicker, on first glance also a higher coverage). 

At the time there is no taxize-function that will pull taxonomic details from a classification returned by NCBI, that's why I plugged together a little wrapper - see here:

# some species data:
spec <- data.frame("Species" = I(c("Bryum schleicheri", "Bryum capillare", "Bryum argentum", "Escherichia coli", "Glis glis")))
spl <- strsplit(spec$Species, " ")
spec$Genus <- as.character(sapply(spl, "[[", 1))
 
# for pulling taxonomic details we'd best submit higher rank taxons
# in this case Genera. Then we'll submit Genus Bryum only once and 
# save some computation time (might be an issue if you deal 
# with large datasets..)
 
gen_uniq <- unique(spec$Genus)
 
# function for pulling classification details ("phylum" in this case)
get_sys_level <- function(x){ require(taxize)
                              a <- classification(get_uid(x))
                              y <- data.frame(a[[1]])                                        # if there are multiple results, take the first..
                              z <- tryCatch(as.character(y[which(y[,2] == "phylum"), 1]),    # in case of any other errors put NA
                                            error = function(e) NA)
                              z <- ifelse(length(z) != 0, z, NA)                             # if the taxonomic detail is not covered return NA
                              return(data.frame(Taxon = x, Syslevel = z))
                             }
 
# call function and rbind the returned values 
result <- do.call(rbind, lapply(gen_uniq, get_sys_level))
print(result)
#         Taxon       Syslevel
# 1       Bryum   Streptophyta
# 2 Escherichia Proteobacteria
# 3        Glis       Chordata
 
# now merge back to the original data frame
spec_new <- merge(spec, result, by.x = "Genus", by.y = "Taxon")
print(spec_new)
#         Genus           Species       Syslevel
# 1       Bryum Bryum schleicheri   Streptophyta
# 2       Bryum   Bryum capillare   Streptophyta
# 3       Bryum    Bryum argentum   Streptophyta
# 4 Escherichia  Escherichia coli Proteobacteria
# 5        Glis         Glis glis       Chordata
#

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}