1. The function apply() is certainly one of the most useful function. I was scared of it during a while and refused to use it. But it makes the code so much faster to write and so efficient that we can't afford not using it. If you are like me, that you refuse to use apply because it is scary, read the following lines, it will help you. You want to know how to use apply() in general, with a home-made function or with several parameters ? Then, go to see the following examples.




    The function apply() is known as extremely useful to improve the speed of our code when we want to operate some functions on a matrix or a vector. I use it, but I do not really know how to use it.

    What is it?

    apply() is a R function which enables to make quick operations on matrix, vector or array. The operations can be done on the lines, the columns or even both of them.

    How does it work?

    The pattern is really simple : apply(variable, margin, function).
    -variable is the variable you want to apply the function to.
    -margin specifies if you want to apply by row (margin = 1), by column (margin = 2), or for each element (margin = 1:2). Margin can be even greater than 2, if we work with variables of dimension greater than two.
    -function is the function you want to apply to the elements of your variable.

    Because I think example is clearer than anything else, here is the most important example to understand the function apply().


    Introduction:

    #the matrix we will work on:
    a = matrix(c(1:15), nrow = 5 , ncol = 3)

    #will apply the function mean to all the elements of each row
    apply(a, 1, mean)
    # [1]  6  7  8  9 10

    #will apply the function mean to all the elements of each column
    apply(a, 2, mean)
    # [1]  3  8 13

    #will apply the function mean to all the elements of each column and of each row, ie each element
    apply(a, 1:2, mean)
    #     [,1] [,2] [,3]
    # [1,]    1    6   11
    # [2,]    2    7   12
    # [3,]    3    8   13
    # [4,]    4    9   14
    # [5,]    5   10   15

    We have just worked on the different margins to show the basic possibilities. But as I said we can also work on other variables such as an array of dimension 3:

    #apply() for array of other dimension :
    a = array(1:8, dim = c(2,2,2))
    apply(a, 3, sum)

    # , , 1
    #
    #       [,1] [,2]
    # [1,]    1    3
    # [2,]    2    4
    #
    # , , 2
    #
    #        [,1] [,2]
    # [1,]    5    7
    # [2,]    6    8


    Use a home-made function:

    We can also use our own function. For example, we reproduce the function sum (absolutely useless but let's keep it simple!).

    f1 = function(x){
      return(sum(x))
    }

    apply(a, 1, f1)

    Several parameters:
    A function with several parameters. Here is the main reason why I was never using apply(), I did not know how to do when I had several parameters in a function. This is quite simple, we just have to specifiy in the parameter the variable which is constant.


    f2 = function(x1,x2){
      x1-x2
    }
    b = 2

    #with the second parameter as an option
    apply(a,1,f2,x2=b)

    #       [,1] [,2] [,3] [,4] [,5]
    # [1,]   -1    0    1    2    3
    # [2,]    4    5    6    7    8
    # [3,]    9   10   11   12   13
    # [3,]   22   24   26   28   30

    #with the first parameter as an option
    apply(a,1,f2,x1=b)

    # [,1] [,2] [,3] [,4] [,5]
    # [1,]    1    0   -1   -2   -3
    # [2,]   -4   -5   -6   -7   -8
    # [3,]   -9  -10  -11  -12  -13

    I hope these examples will help you. Personally I use the function apply() since I went through these few examples. Of course there are many other details which can be useful, but these first examples deal with the main useful possibilities of apply().
    1

    View comments

  2. Have you ever played the board game "Guess who?". For those who have not experienced childhood (because it might be the only reason to ignore this board game), this is a game consisting in trying to guess who the opponent player is thinking of among a list of characters - we will call the one he chooses the "chosen character". These characters have several characteristics such as gender, having brown hair or wearing glasses. To find out, you are only allowed to ask questions expecting a yes-no answer.



    This game has been expanded to a further complexity through the funny and impressive website: Akinator. The software tries to guess who we are thinking of by asking yes-no questions.

    With a friend/colleague/classmate of mine, Pierre Cordier, we wondered how it worked and what would be the fastest way to find the answer. In particular, is it better to ask about a very balanced characteristic, such as "Is it a female?", or a very unbalanced characteristic such as "Is it an alien?". The first question is very likely to remove fifty percent of the population, while the second question is very likely to eliminate a very small part of the population; on the other hand, if we are lucky, we can have the answer "Yes, it is an alien" and, in this case, we will directly know who is the "chosen character". In other words, the mantra of the first strategy is "No risk, no reward", while the second strategy's one is "High risk, high reward".

    To answer this mysterious question, we had a neural network approach. More especially, we had a bipartite neural network approach, also known as institutional neural network. We use as a reference the book by  David Easley and  Jon Kleinberg, Networks, crowds and markets: reasoning about a highly connected world, introduced earlier in Spatial segregation in cities - An explanation by a neural network model. The idea is that every character belongs, or not, to different institutions/groups such as "wearing glasses". In term of mathematical approach, we can represent this neural network as a matrix where every column is an institution and each row is an individual. If the individual i belongs to the institution j, then the intersection of the row i and the column j would be 1, otherwise it would be 0.

    We have first randomly generated this matrix, and then computed two strategies. The first one checks for the characteristic with the most balanced distribution of "yes" and "no". The second strategy finds the characteristic with the nearest distribution to 90% of "yes" for 10% of "no", or 10% of "yes" for 90% of "no".

    We have done this simulation many times (considering 100 individuals for 10 institutions) to have an estimation of the number of questions needed to find the "chosen character". We have made two observations. In average, the fifty/fifty strategy is faster than the ninety/ten strategy. Besides, the number of questions needed varies less for the first one. Therefore, the first strategy is more efficient in average, and more reliable (as its variance is lower).
    The dark blue line stands for the "90/10" strategy, and the other one for the "50/50". The horizontal lines represent the mean for each strategy. The "50/50" strategy is almost always beating the "90/10" strategy.

    After 50 simulations, the probability distributions of the number of questions give a good insight of the spread between the two strategies.

    Probability distributions

    The code (R):

    # Who s who

    # install.packages("shape")

    #functions
    # creation of matrix
    createMatrix = function(myNumberPeople, myNumberInstitution){
      mat = matrix(0, nrow = myNumberPeople, ncol = myNumberInstitution)
      for (i in 1:(myNumberPeople * myNumberInstitution)){
        mat[i]= rbinom(1, 1, prob = 0.5)
      }
      return(mat)
    }

    #calculation of proportion
    proportion = function(myMatrix){
      result = matrix(0, 1,length(myMatrix[1,]))
      for(i in 1:length(result)){
        result[i] = sum(myMatrix[,i])
      }
      return(result/nrow(myMatrix))
     
    }

    #strategy based on proportion
    closest = function(myMatrix, a, p){
      vect = proportion(myMatrix)
      for(i in 1:length(vect)){
        vect[i] = min(abs(p - vect[i]),abs((1-p) - vect[i]))
      }
      return(a[which(vect[a] == min(vect[a]))[1]])
    }

    belong = function(myMatrix,myGuy,a , p){
      return(myGuy[closest(myMatrix, a, p)]==1)
    }

    elimination = function(myMatrix,myGuy, a, p){
      myMatrix = myMatrix[(belong(myMatrix,myGuy, a, p)==myMatrix[,closest(myMatrix, a, p)]),]
      return(myMatrix)
    }

    strategy = function(myMatrix,myGuy, p){
      k = 0
      a = 1:numberInstitution
      while(length(a) > 0 & if(is.vector(myMatrix) != 1){length(myMatrix[!duplicated(myMatrix),])/numberInstitution != 1}else{FALSE}){
        b = a[-which(a==closest(myMatrix, a, p))]
        myMatrix = elimination(as.matrix(myMatrix), myGuy, a, p)
        a= b
        k = k+1
        #     print(dim(myMatrix))
      }
      res = list(NumberSteps=k, Candidates=myMatrix)
      return(res)
    }

    #random strategy
    createOrder = function(){
      return(sample(1:numberInstitution, numberInstitution))
    }

    belongRandom = function(myMatrix,myGuy, myVariable){
      return(myGuy[myVariable]==1)
    }

    eliminationRandom = function(myMatrix,myGuy,a){
      myMatrix = myMatrix[(belongRandom(myMatrix,myGuy, a[1])==myMatrix[,a[1]]),]
      return(myMatrix)
    }

    strategyRandom = function(myMatrix,myGuy){
      k = 0
      a = createOrder()
      while(length(a) > 0 & (if(is.vector(myMatrix) != 1){length(myMatrix[!duplicated(myMatrix),])/numberInstitution != 1}else{FALSE})){
        myMatrix = eliminationRandom(as.matrix(myMatrix), myGuy,a)
        a = a[-1]
        k = k+1
      }
      res = list(NumberSteps=k, Candidates=myMatrix)
      return(res)
    }

    application = function(myMatrix, myGuy, myStrategy){
      #   myStrategy is in {50, random, 90}
      if(myStrategy == "random"){return(strategyRandom(myMatrix, myGuy))}
    }

    # initialization
    numberPeople = 100
    numberInstitution = 10
    memory = list("fifty" = c(), "random" = c(), "ninety" = c())
    for(i in 1:50){
      mat = createMatrix(numberPeople, numberInstitution)
      copyMat = mat
      guy = copyMat[sample(1:length(copyMat[,1]),1),]
      #   copy2 = mat
      #   copy3 = mat
      #   guy = mat[sample(1:numberPeople,1),]
      #   memory$fifty = c(memory$fifty, application(copy1, guy, "50")$NumberSteps)
      #   memory$random = c(memory$random, application(copy2, guy, "random")$NumberSteps)
      memory$ninety = c(memory$ninety, strategy(copyMat, guy, 0.1)$NumberSteps)
      memory$random = c(memory$random, application(copyMat, guy, "random")$NumberSteps)
      memory$fifty = c(memory$fifty, strategy(copyMat, guy, 0.5)$NumberSteps)
    }


    #plotting
    library(shape)
    png(filename="~/Chr10/post9_figure1.png", bg="white")
    col <- shadepalette(9, "cyan", "blue")
    plot(c(),ylim=c(4.2,10), xlim = c(3, 48), xlab ="Simulationss", ylab ="Number of questions")
    polygon(c(1:50,50:1), c(memory$fifty,rev(memory$ninety)), col="#EFDECD")
    a <- c(mean(memory$fifty)-var(memory$fifty),mean(memory$fifty)+var(memory$fifty))
    b <- c(mean(memory$ninety)-var(memory$ninety),mean(memory$ninety)+var(memory$ninety))
    lines(memory$ninety, type = 'l', col=col[3], lwd = 2)
    abline(h=mean(memory$fifty), col=col[7], lwd =2)
    abline(h=mean(memory$ninety), col=col[3], lwd = 2)
    lines(memory$fifty, col = col[7], lwd = 2)
    title("Difference between the two strategies")
    dev.off()

    png(filename="~/Chr10/post9_figure2.png", bg="white")
    plot(table(memory$ninety)/50, ylim=c(0,0.7), type="p",col = col[3], lwd = 4,
         main="Distribution of the number of questions
    for both strategies",
         xlab="Number of questions",
         ylab="Probability")
    lines(table(memory$fifty)/50, type="p",col = col[7], lwd = 4)
    legend(x=4, y= 0.7,legend =c("50/50", "90/10"), col=c(col[7],col[3]), pch=  1, pt.lwd = 4)
    dev.off()

    0

    Add a comment

  3. If you want to choose randomly your next holidays destination, you are likely to process in a way which is certainly biased. Especially if you choose randomly the latitude and the longitude. A bit like they do in this lovely advertising (For those of you who do not speak French, this is about a couple who have won the national gamble prize and have to decide their next travel. The husband randomly picks Australia and the wife is complaining : "Not again!"). So let me help you to choose your next travel!


    If we were able to generate uniformly distributed variables on [0 ; 1], we could easily generate variables on other spaces such as [0 ; 1] x [0 ; 1], which is a square of side 1. It can be simply done by using two independant variables X1 and X2, uniformly distributed on [0 ; 1], and by considering X = (X1 , X2). Then X is a random variable evenly distributed on the square [0 ; 1] x [0 ; 1]

    However, this generation may be a bit more complicated if we work with more complicated shapes, such as a sphere for example. Indeed, for a sphere of a given radius, say R = 1, it is possible to project the variables of a square on the sphere.

    The first method I thought of, is wrong. I was tempted to generate two variables Y1 and Y2 where Y1 follows a uniform on  [0 ; Pi] and Y2 follows a uniform on [-Pi ; Pi]. In other words :
    - Y1 = Pi * X1
    - Y2 = 2Pi * (X2 - 0.5).
     Then I wanted to consider (Y1, Y2) as the spherical coordinated of the sphere.

    However, this method does NOT generate a uniform on the sphere. Indeed, it has a tendency to over generate north and south poles. The reason is simple, this method generates, in average the same amount of variable in each latitude. North and South poles are of smaller area than Equator. Therefore, the closer we are from the Equator latitude the less variables there are.

    On the following graph, we observe a high density of point in the "middle" of the sphere, this is the North Pole. It shows that this method does not offer a uniform distribution. By the way, the graph is computed with the library rgl, which provides a display device for 3 dimensions objects. And then a little code provided in the rgl help allows to move automatically the device and take a snapshot for each view. You can eventually generate a GIF on gifmake (like for the map in Spatial segregation in cities - An explanation by a neural network model).
    A wrong method to generate evenly distributed variables on the sphere. The North Pole and the South Pole are over represented.

    Many methods have been proposed to generate evenly distributed random variables on a sphere. We propose one of them here. We consider the couple z = (u,v) defined as :

    - u = 2 * Pi * X1
    - v = arc-cos(2 * (X2 - 0.5))

    In this case, a theorem shows that z = (u , v) generates evenly distributed variables. It can be observed on the next graph. There is no irregularity in the distribution of the random variables.
    A correct simulation of a uniform distribution on a sphere. There is no over represented area.

    Other methods exist. I like this one since it is really simple and uses a uniform distribution at the beginning. The idea of this post is to show that generating a uniform distribution can be adapted to many shapes and cases. However, to do so, a previous analytical study has to be done to find the correct transformation. 

    So this method would help you to avoid being too many times in the chilly places such as North Pole and South Pole since it does not overrepresent the extreme latitudes.

    The program (R) :

    # import package to plot in 3D
    install.packages("rgl", dependencies = TRUE)
    library(rgl)

    ################################################################
    # Uniform distribution in a square
    ################################################################

    size = 10000
    x1 = runif(size)
    x2 = runif(size)

    # the option pch = '.' change the symbol for the graph into dot.
    # cex = 2 doubles the size of the dots
    plot(x1,x2, col = 'blue', pch = '.', cex = 2)

    ################################################################
    # Wrong solution for the sphere
    ################################################################

    y1 = pi * x1
    y2 = 2* pi * (x2-0.5)
    y = matrix(0, nrow = 2, ncol = size)
    y[1,] = y1
    y[2,] = y2
    plot(y1, y2)

    # This function transform the spherical coordinates into cartesian coordinates
    sphereToCartesian = function(matrice){
      x= matrix(0,nrow = 3, ncol = length(matrice[1,]))
      x[1,] = sin(matrice[2,]) * cos(matrice[1,])
      x[2,] = sin(matrice[2,]) * sin(matrice[1,])
      x[3,] = cos(matrice[2,])
      return(x)
    }

    a = sphereToCartesian(y)
    plot3d(x = a[1,], y = a[2,], z = a[3,])

    #you should enlarge the device window, before running this, if you want to have a meaningful graph
    rgl.bringtotop()
    rgl.viewpoint(0,20)

    for (i in 1:45) {
      rgl.viewpoint(i,20)
      filename <- paste("pic",i,".png",sep="")
      rgl.snapshot(filename, fmt="png")
    }


    ################################################################
    # Correct solution for the sphere
    ################################################################

    uniformSphere = function(length){
      x1 = runif(length, 0,1)
      x2 = runif(length, 0,1)
      u = 2*pi*x1
      v = acos(2*x2- 1)
      z =matrix(0, ncol = length, nrow = 2)
      z[1,] = u
      z[2,] = v
      return(z)
    }

    z = uniformSphere(size)
    b = sphereToCartesian(z)
    plot3d(x = b[1,], y = b[2,], z = b[3,])


    4

    View comments

  4. My previous post is about a method to simulate a Brownian motion. A friend of mine emailed me yesterday to tell me that this is useless if we do not know how to simulate a normally distributed variable.

    My first remark is: use the rnorm() function if the quality of your simulation is not too important (Later, I'll try to explain you why the R "default random generation" functions are not perfect). However, it may be fun to generate a normal distribution from a simple uniform distribution. So, yes, I lied, I won't create the variable from scratch but from a uniform distribution. 

    The method proposed is really easy to implement and this is why I think it is a really good one. Besides, the result is far from being trivial and is really unexpected. This method is called the Box-Muller method. You can find the proof of this method here. The proof is not very complicated, however, you will need a few mathematical knowledges to understand it.

    Let u and v be two independent variables uniformly distributed.  Then we can define:

    x = sqrt(-2log(u))sin(2 PI v)
    y = sqrt(-2log(u))cos(2 PI v)

    x and y are two independent and normally distributed variables. The interest of this method is its extreme simplicity in term of programming (We only need 9 lines if we don't want to test the normality of the new variables neither plot the estimation of the density).

    We can obtain a vector of variables normally distributed. The Lillie test doesn't reject the null hypothesis of normal distribution. Besides, we can plot the estimation of the density of the variables. We obtain the following plot that looks indeed similar to the Gaussian density.



    The program (R):

     # import the library to test the normality of the distribution
    library(nortest)

    size = 100000

    u = runif(size)
    v = runif(size)

    x=rep(0,size)
    y=rep(0,size)

    for (i in 1:size){
      x[i] = sqrt(-2*log(u[i]))*cos(2*pi*v[i])
      y[i] = sqrt(-2*log(u[i]))*sin(2*pi*v[i])
    }

    #a test for normality
    lillie.test(c(x,y))

    #plot the estimation of the density
    plot(density(c(x,y)))


    0

    Add a comment

Blog Archive
Translate
Translate
Loading