Over a million developers have joined DZone.

Using Python to Extract Excel Spreadsheet Into CSV Files

DZone's Guide to

Using Python to Extract Excel Spreadsheet Into CSV Files

I want to create a CSV file for each Excel sheet so that I can import the data set into Neo4j using the LOAD CSV command.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

I’ve been playing around with the Road Safety open data set and the download comes with several CSV files and an Excel spreadsheet containing the legend.

There are 45 sheets in total and each of them looks like this:

2015 08 17 23 33 19

I wanted to create a CSV file for each sheet so that I can import the data set into Neo4j using the LOAD CSV command.

I came across the Python Excel website which pointed me at the xlrd library since I’m working with a pre 2010 Excel file.

The main documentation is very extensive but I found the github example much easier to follow.

I ended up with the following script which iterates through all but the first two sheets in the spreadsheet – the first two sheets contain instructions rather than data:

from xlrd import open_workbook
import csv

wb = open_workbook('Road-Accident-Safety-Data-Guide-1979-2004.xls')

for i in range(2, wb.nsheets):
    sheet = wb.sheet_by_index(i)
    print sheet.name
    with open("data/%s.csv" %(sheet.name.replace(" ","")), "w") as file:
        writer = csv.writer(file, delimiter = ",")
        print sheet, sheet.name, sheet.ncols, sheet.nrows

        header = [cell.value for cell in sheet.row(0)]

        for row_idx in range(1, sheet.nrows):
            row = [int(cell.value) if isinstance(cell.value, float) else cell.value
                   for cell in sheet.row(row_idx)]

I’ve replaced spaces in the sheet name so that the file name on a disk is a bit easier to work with. For some reason the numeric values were all floats whereas I wanted them as ints so I had to explicitly apply that transformation.

Here are a few examples of what the CSV files look like:

$ cat data/1stPointofImpact.csv
0,Did not impact
-1,Data missing or out of range

$ cat data/RoadType.csv
2,One way street
3,Dual carriageway
6,Single carriageway
7,Slip road
12,One way street/Slip road
-1,Data missing or out of range

$ cat data/Weather.csv
1,Fine no high winds
2,Raining no high winds
3,Snowing no high winds
4,Fine + high winds
5,Raining + high winds
6,Snowing + high winds
7,Fog or mist
-1,Data missing or out of range

And that’s it. Not too difficult!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

python ,big data ,excel spreadsheet ,csv file ,load csv

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}