# From a Random Generator to a Random Function

# From a Random Generator to a Random Function

Join the DZone community and get the full member experience.

Join For Free**The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.**

This week-end, I wrote a post since I had some trouble to generate a sample random sample with R, to reproduce one obtained by a co-author, with SAS (generated using Fishman and Moore (1982) used in function RANUNI). I was lucky since another contributor for that book, Christrophe Dutang, got the anwer to the last question I asked: is it possible to reproduce the random generator? Yes, we can. And it is quite simple, if you use the appropriate library and the appropriate function.

> library(randtoolbox) > a <- 397204094 > b <- 0 > m <- 2^(31)-1 > set.generator(name="congruRand", mod=m, mult=a, incr=b, seed=123) > runif(10) [1] 0.7503961 0.3209120 0.1783896 0.9060334 0.3571171 [6] 0.2211140 0.7864383 0.3980819 0.1246652 0.1876858

If you check in the previous post this is exactly what SAS gave us (and that I could not reproduce by myself). But that was only one part of my problems, since the goal was actually to reproduce indices for a training subsample for credit scoring issues.

I have to admit that I had never though about it before: how should we write a sample function? If values can be replaced, that is fine, we just have to split the unit interval correctly. Like

> set.seed(95) > (U=runif(10)) [1] 0.15171155 0.57584087 0.05309844 0.07044485 0.48887914 0.15276707 [7] 0.37405684 0.30006096 0.96997126 0.30373498 > set.seed(95) > (R=sample(0:99,size=10,replace=TRUE)) [1] 15 57 5 7 48 15 37 30 96 30

Here, we just truncate from the values obtained from the random generator. And that is just fine. But how do we write a code to sample *without replacements*? I mean, how do you get that :

> (S=sample(0:99,size=10,replace=FALSE)) [1] 15 57 5 6 46 14 35 27 89 92

My initial idea was to use the following technique. The first value is easy to get: we just split the unit interval into 100 subdivision (as before since for the first value, replacement or not, we don’t care) and see in which interval the random value is. And we remove that value from our sample. Then, we split the unit intervall into 99 subdivision, and see in which interval the random value is. It is the 10th? Fine, then our second value is the 10th from our sample (the first value has been removed). Then we split the unit interval in 98 subdivision, etc. The code I wrote to produce that algorithm was the following,

> mysample1=function(N,unif){ + n=length(N) + size=length(unif) + V0=N[trunc(unif[1]*n)+1] + N=N[-which(N==V0)] + V=V0 + for(i in 2:length(unif)){ + V0=N[trunc(unif[i]*(n-i+1))+1] + N=N[-which(N==V0)] + V=c(V,V0)} + return(V)}

Unfortunetely, I could not reproduce the sample obtained with the R function,

> mysample1(0:99,unif=U) [1] 15 58 5 7 49 17 39 31 97 32 > S [1] 15 57 5 6 46 14 35 27 89 92

Since Christrophe is an expert on random generators, I did ask him, one more time. And he came up with the following code,

> mysample2=function(N,unif){ + integerset=1:length(N) + result=rep(NA,length(unif)) + for(i in 1:length(unif)){ + intchosen=integerset[ceiling(U[i]*(length(N)-i+1))] + integerset[intchosen]=length(integerset) + integerset=integerset[-length(integerset)] + result[i]=intchosen} + return(N[result])}

which works just fine !

> mysample2(0:99,unif=U) [1] 15 57 5 6 46 14 35 27 89 92 > S [1] 15 57 5 6 46 14 35 27 89 92

So now, not only can we reproduce random numbers obtained with other software, we can also obtain the same samples indices, with or without replacement ! Thanks Christophe !

**[May, 15th]** Note that this is note the generator used in SAS, actually. In order to reproduce the sample function of SAS, the algorithm is much more simple, by clearly not that efficicient since we generate a random sample of size 100 (if we *have* 100 observations), and then, we keep the values associated to the indices of the 10 smallest (if we want a sample of size 10). The code could be

> mysample3=function(N,unif,size){ + q=sort(unif)[size] + return(N[U<=q])} > library(randtoolbox) > a <- 397204094 > b <- 0 > m <- 2^(31)-1 > set.generator(name="congruRand", mod=m, mult=a, incr=b, seed=123) #OK > U=runif(100) > mysample3(1:100,U,size=10) [1] 27 37 47 59 60 71 75 82 84 87

Thanks Jean-Philippe for the idea (which works).

**Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.**

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}