Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Extracting Datasets from Excel Files in a Zipped Folder

DZone's Guide to

Extracting Datasets from Excel Files in a Zipped Folder

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

The title of the post is a bit long, but that’s the problem I was facing this morning: importing datasets from files, online. I mean, it was not a “problem” (since I can always download and extract the files manually), more a challenge (I should be able to do it in R, directly). The files are located on ressources-actuarielles.net, in a zip file. Those are mortality tables used in French-speaking African countries, and I guess that one problem came from special characters, such as “é” or “è”… When you open the zip file, you see a folder:

And in that folder, several files that I would like to import:

My first code was quite standard, unfortunately, it did not work

> library(xlsx)
> loc.url <- "http://ressources-a...CIMA.zip"
> temp <- tempfile()
> download.file(loc.url,temp)
trying URL 'http://ressources-a...CIMA.zip'
Content type 'application/x-zip' length 503786 bytes (491 Kb)
opened URL
downloaded 491 Kb

> list.files <- unzip(temp,list=TRUE)
> data <- read.xlsx(unz(temp,
+ list.files$Name[1]),sheetIndex=1)
Error in path.expand(file) : invalid 'path' argument

Indeed, there was a problem with the path,

> list.files$Name[1]
[1] "Tables de mortalit‚ CIMA/CIMA F.xlsx"

Thankfully, as usual, @3wen came to rescue me, and suggested the following,

> Sys.setlocale("LC_ALL", "C")
[1] "C"
> loc.url <- "http://ressources-a...CIMA.zip"
> td <- tempdir() 
> tf <- tempfile(tmpdir=td, fileext=".zip") 
> download.file(loc.url, tf) 
trying URL 'http://ressources-a...CIMA.zip' Content type 'application/x-zip' length 503786 bytes (491 Kb) opened URL 
============================================= downloaded 491 Kb   

> fname <- unzip(tf, list=TRUE)$Name[1] 
> unzip(tf, files=fname, exdir=td,
+ overwrite=TRUE) 
> fpath <- file.path(td, fname)

That was it… then @3wen suggested another package to read the xlsx file (that works well on my Linux laptop, not on Windows)

> data <- read.xls(fpath, sheet = 1)
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = "fr_CA:fr",
    LC_ALL = (unset),
    LANG = "fr_CA.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
> Sys.setlocale()
[1] ""

> head(data)
  Age Lx..CIMA.F. qx..CIMA.F. dx..CIMA.F.
1   0   1000000.0     0.2849%   2848.5034
2   1    997151.5     0.0387%    385.8644
3   2    996765.6     0.0296%    295.2717
4   3    996470.4     0.0252%    251.5163
5   4    996218.8     0.0216%    214.9381
6   5    996003.9     0.0197%    195.9929

That was more subtle than expected…

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}