Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Import Some Parts of a Large Database

DZone's Guide to

How to Import Some Parts of a Large Database

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

In the introduction of Computational Actuarial Science with R, there was a short paragraph on how could we import only some parts of a large database, by selecting specific variables. The trick was to use the following

> read.table.select.columns=function(datatablename,
I,sep=";"){
+ datanc=read.table(datatablename,header=TRUE,
sep=sep,skip=0,nrows=1)
+ mycols=rep("NULL",ncol(datanc))
+ names(mycols)=names(datanc)
+ mycols[I]=NA
+ datat=read.table(datatablename,header=TRUE,
sep=sep,colClasses=mycols)
+ return(datat)}

For instance, if we use the same dataset as in the introduction, we can import only two variables of interest,

> loc="http://myweb.fsu.edu/jelsner/extspace/extremedatasince1899.csv"
> dt1=read.table.select.columns(loc,c("Region",
"Wmax"),sep=",")
> head(dt1,10)
    Region      Wmax
1    Basin 105.56342
2    Basin  40.00000
3    Basin  35.41822
4    Basin  51.06743
5  Florida  87.34328
6    Basin  96.64138
7     Gulf  35.41822
8       US  35.41822
9       US  87.34328
10      US 106.35318
> dim(dt1)
[1] 2100    2

In other cases, it might be interesting to select some raws, or to avoid some of them (e.g. because of some typos in the original dataset). If we want to drop some specific raws, we can use

> read.table.drop.rows=function(datatablename,
I,sep=";"){
+ I=sort(I)
+ if(min(I)>1) minI=1
+ if(min(I)==1) minI=NULL
+ index1=c(minI,I[c(which(diff(I)>1),length(I))]+1)
+ index2=c(I[c(minI,which(diff(I)>1)+1)],
max(index1)-1)
+ datat=read.table(datatablename,header=TRUE,
sep,skip=0,nrows=1)
+ datat=datat[-1,]
+ for(i in 1:length(index1)){
+ datat0=read.table(datatablename,header=FALSE,
sep=sep,skip=index1[i],nrows=index2[i]-index1[i])
+ names(datat0)=names(datat)
+ datat=rbind(datat,datat0)
+ }
+ return(datat)}

On the same dataset, assume that we have some troubles reading some lines, or we know that values are not valid,

> dt2=read.table.drop.rows(loc,c(3,6:8),sep=",")
> head(dt2[,1:5],10)
     Yr  Region      Wmax      sst sun
1  1899   Basin 105.56342 0.046596 8.4
2  1899   Basin  40.00000 0.046596 8.4
3  1899   Basin  51.06743 0.046596 8.4
4  1899 Florida  87.34328 0.046596 8.4
5  1899      US  87.34328 0.046596 8.4
6  1899      US 106.35318 0.046596 8.4
7  1899      US  51.06743 0.046596 8.4
8  1899      US  90.19791 0.046596 8.4
9  1899   Basin  56.48593 0.046596 8.4
10 1899   Basin 131.26902 0.046596 8.4
> dim(dt2)
[1] 2096   11

Now, we can try to do both at the same time (more or less) : for some specific variables, we want to import a subpart of the database, but we also want to avoid some specific lines,

> read.table.select.columns.rows=function(
datatablename,Ic,Ir,sep=";"){
+ datanc=read.table(datatablename,header=TRUE,
sep=sep,skip=0,nrows=1)
+ mycols=rep("NULL",ncol(datanc))
+ names(mycols)=names(datanc)
+ mycols[Ic]=NA
+ I=sort(Ir)
+ if(min(I)>1) minI=1
+ if(min(I)==1) minI=NULL
+ index1=c(minI,I[c(which(diff(I)>1),length(I))]+1)
+ index2=c(I[c(minI,which(diff(I)>1)+1)],
max(index1)-1)
+ datat=read.table(datatablename,header=TRUE,
sep=sep,skip=0,nrows=1,colClasses=mycols)
+ datat=datat[-1,]
+ for(i in 1:length(index1)){
+ datat0=read.table(datatablename,header=FALSE,
sep=sep,skip=index1[i],nrows=index2[i]-index1[i],colClasses=mycols)
+ names(datat0)=names(datat)
+ datat=rbind(datat,datat0)
+ }
+ return(datat)}
> dt3=read.table.select.columns.rows(loc,
c("Wmax","Region"),c(3,6:8),sep=",")
> head(dt3)
   Region      Wmax
1   Basin 105.56342
2   Basin  40.00000
3   Basin  51.06743
4 Florida  87.34328
5      US  87.34328
6      US 106.35318
> dim(dt3)
[1] 2096    2

One can observe that it is the same, here, as

> df=read.table(loc,header=TRUE,sep=",")
> sdf=df[-c(3,6:8),c("Wmax","Region")]
> head(sdf)
        Wmax  Region
1  105.56342   Basin
2   40.00000   Basin
4   51.06743   Basin
5   87.34328 Florida
9   87.34328      US
10 106.35318      US
> dim(sdf)
[1] 2096    2

But if the dataset was much larger (with thousands of variables) with also some problems on specific lines, we now have a nice code to import our database.


The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}