I am currently doing an internship in England. Therefore, I keep alternating between French and English in my different emails and other forms of communication on the Internet. I have been surprised to see that some websites are able to recognize when I use French or when I use English. For example, Facebook automatically proposes me to translate. I was really amazed by this ability, how can a computer know what language I am using? Especially, I use a QWERTY keyboard, which means without accent. Therefore, I don’t write in a proper French and I never use accents.

I remembered some courses of data analysis and of computer science where the issue was to determine the group whom an individual belonged to.


The problem:

If I choose a text, how can I make my computer know if it’s an English text or a French text without accent? Here we try to differentiate without accent since it would be a very (too much) easy way to do it. In particular, how can I do if I can’t have access to French and English dictionaries?

The model:

We assume we have a sample of French and English texts which will be used as benchmarks. We call frenchText(i) the i-th French text we have, and englishText(j) the j-th English text we have. The aim is to determine if the text called TEXT is French or English.

For every text, we count the proportion of every letter. We only take into account the 26 alphabet letters in order to keep the program as simple as possible. Then we compute the average for all the French texts and for all the English texts. In such a way we obtain the normal proportion of each letter in an “average” French (or English) text.

Then, we count the proportion of occurrence of the letters in TEXT and we compute the Euclidean distance between TEXT and the average French text (we call this distance d(TEXT, averageF)) and between TEXT and the average English text (we call this distance d(TEXT, averageE)). If d(TEXT, averageF) is greater than d(TEXT, averageE) we consider that TEXT is more similar to the English texts and therefore is certainly English. Respectively, if d(TEXT, averageF) < d(TEXT, averageE), we consider the text to be French.

The Euclidean distance between two vectors p and q of dimension n.

What is interesting is that, once we have considered a first TEXT, we can, according to the decision we made, take into account TEXT in the calculation of the average letter occurrences in the French texts (respectively English texts), if it has been considered as a French text (resp. as an English text). And then we can use this more accurate average to determine the language of the new text. This is why it is called statistical learning. The more text we compute, the more likely the decision is right.

It would be interesting to use a weighted average in order to take into account the probability of the event [TEXT is French] (resp. [TEXT is English]).

The results:

For the initialization, I have used two French texts and two English texts. Each text is small (about 100 words).

Then I have tested the program on different French and English texts. The results are relevant with the languages of the texts.

If you want to run this program, you will need the files .txt. You can find them on my Google page : https://sites.google.com/site/probaperceptionstock/, the file is a ZIP called textPost5. You can unzip and save this file on your computer in any folder you want.

Warnings:

This method must be use carefully, if the first tests on the first texts are wrong, then, the results are likely to be inconsistent. Indeed, a wrong proportion of letters has consequences on the determination of a text.

Besides, we use the Euclidean distance in order to make a clean and simple presentation. However, it is far from being the best distance for such a problem. Data analysis methods, may well be more relevant. For example, as I explain in a previous post (http://probaperception.blogspot.co.uk/2012/09/v-behaviorurldefaultvmlo.html) the Mahalanobis distance could be interesting in this context.

The code (R):


setwd("U:/Blog/Post5")
#you will have to change this directory according to your own folder

countLetter = function(lowerCase,upperCase, myTable){
            if(is.na(myTable[lowerCase]+myTable[upperCase])){
                        if(is.na(myTable[lowerCase])){return(myTable[upperCase])}
                        else{return(myTable[lowerCase])}
            }
            else{return(myTable[lowerCase] + myTable[upperCase])}
}


proportion = function(myText){

myTextSplit = strsplit(myText,NULL)
table = table(myTextSplit)

a = countLetter("a", "A", table)
b = countLetter("b", "B", table)
c = countLetter("c", "C", table)
d = countLetter("d", "D", table)
e = countLetter("e", "E", table)
f = countLetter("f", "F", table)
g = countLetter("g", "G", table)
h = countLetter("h", "H", table)
i = countLetter("i", "I", table)
j = countLetter("j", "J", table)
k = countLetter("k", "K", table)
l = countLetter("l", "L", table)
m = countLetter("m", "M", table)
n = countLetter("n", "N", table)
o = countLetter("o", "O", table)
p = countLetter("p", "P", table)
q = countLetter("q", "Q", table)
r = countLetter("r", "R", table)
s = countLetter("s", "S", table)
t = countLetter("t", "T", table)
u = countLetter("u", "U", table)
v = countLetter("v", "V", table)
w = countLetter("w", "W", table)
x = countLetter("x", "X", table)
y = countLetter("y", "Y", table)
z = countLetter("z", "Z", table)
total = sum(c(a, b , c, d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z), na.rm = T)

list = c(a, b , c, d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z)/total
list[which(is.na(list))] = 0
return(list)
}


distance = function(myVector, myGroup){
            mean = apply(myGroup,2,mean)
            dist = 0
            for (i in 1:length(myVector)){
                        dist = dist + (mean[i]-myVector[i])**2
            }
            return(sqrt(dist))
}

choice = function(myVector, mypropF, mypropE){
            if(distance(myVector, mypropF) > 1.1*distance(myVector, mypropE)){
                        cat("the text is certainly English")
                        mypropE = rbind(mypropE, myVector)
            }
            else if(distance(myVector, mypropE) > 1.1*distance(myVector, mypropF)){
                        cat("the text is certainly French")
                        mypropF = rbind(mypropF, myVector)
            }
            x = vector("list", 2)
            x[[1]] = mypropF
            x[[2]] = mypropE
           
            return(x)
}

fileName = 'tf1.txt'
tf1 = readChar(fileName, file.info(fileName)$size)
fileName = 'tf2.txt'
tf2 = readChar(fileName, file.info(fileName)$size)
fileName = 'te1.txt'
te1 = readChar(fileName, file.info(fileName)$size)
fileName = 'te2.txt'
te2 = readChar(fileName, file.info(fileName)$size)

fileName = 'tm1.txt'
tm1 = readChar(fileName, file.info(fileName)$size)
propM1 = proportion(tm1)
#French Text


fileName = 'tm2.txt'
tm2 = readChar(fileName, file.info(fileName)$size)
propM2 = proportion(tm2)
#French Text


fileName = 'tm3.txt'
tm3 = readChar(fileName, file.info(fileName)$size)
propM3 = proportion(tm3)
#French Text


fileName = 'tm4.txt'
tm4 = readChar(fileName, file.info(fileName)$size)
propM4 = proportion(tm4)
#English Text


fileName = 'tm5.txt'
tm5 = readChar(fileName, file.info(fileName)$size)
propM5 = proportion(tm5)
#English Text


fileName = 'tm6.txt'
tm6 = readChar(fileName, file.info(fileName)$size)
propM6 = proportion(tm6)
#English Text

list = choice(propM1, propF, propE)
propF = list[[1]]
propE = list[[2]]

list = choice(propM2, propF, propE)
propF = list[[1]]
propE = list[[2]]

list = choice(propM3, propF, propE)
propF = list[[1]]
propE = list[[2]]

list = choice(propM4, propF, propE)
propF = list[[1]]
propE = list[[2]]

list = choice(propM5, propF, propE)
propF = list[[1]]
propE = list[[2]]

list = choice(propM6, propF, propE)
propF = list[[1]]
propE = list[[2]]

0

Add a comment

The financial market is not only made of stock options. Other financial products enable market actors to target specific aims. For example, an oil buyer like a flight company may want to cover the risk of increase in the price of oil. In this case it is possible to buy on the financial market what is known as a "Call" or a "Call Option".

A Call Option is a contract between two counterparties (the flight company and a financial actor). The buyer of the Call has the opportunity but not the obligation to buy a certain  quantity of a certain product (called the underlying) at a certain date (the maturity) for a certain price (the strike).

I found a golden website. The blog of Esteban Moro. He uses R to work on networks. In particular he has done a really nice code to make some great videos of networks. This post is purely a copy of his code. I just changed a few arguments to change colors and to do my own network.

To create the network, I used the  Barabási-Albert algorithm that you can find at the end of the post on the different algorithms for networks. Igraph is the library which has been used.
3

As you have certainly seen now, I like working on artificial neural networks. I have written a few posts about models with neural networks (Models to generate networks, Want to win to Guess Who and Study of spatial segregation).

Unfortunately, I missed so far a nice and pleasant aspect of networks : its graphical approach. Indeed, plots of neural networks are often really nice and really useful to understand the network.

Sometimes such a graph can point out some characteristics of the network.
1

I already talked about networks a few times in this blog. In particular, I had this approach to explain spatial segregation in a city or to solve the Guess Who? problem. However, one of the question is how to generate a good network. Indeed, I aim to study strategy to split a network, but I need first to work with a realistic neural network. I could have downloaded data of a network, but I'd rather study the different models proposed to generate neural networks.

The function apply() is certainly one of the most useful function. I was scared of it during a while and refused to use it. But it makes the code so much faster to write and so efficient that we can't afford not using it. If you are like me, that you refuse to use apply because it is scary, read the following lines, it will help you. You want to know how to use apply() in general, with a home-made function or with several parameters ? Then, go to see the following examples.
1

Have you ever played the board game "Guess who?". For those who have not experienced childhood (because it might be the only reason to ignore this board game), this is a game consisting in trying to guess who the opponent player is thinking of among a list of characters - we will call the one he chooses the "chosen character". These characters have several characteristics such as gender, having brown hair or wearing glasses.

If you want to choose randomly your next holidays destination, you are likely to process in a way which is certainly biased. Especially if you choose randomly the latitude and the longitude. A bit like they do in this lovely advertising (For those of you who do not speak French, this is about a couple who have won the national gamble prize and have to decide their next travel. The husband randomly picks Australia and the wife is complaining : "Not again!").
4

My previous post is about a method to simulate a Brownian motion. A friend of mine emailed me yesterday to tell me that this is useless if we do not know how to simulate a normally distributed variable.

My first remark is: use the rnorm() function if the quality of your simulation is not too important (Later, I'll try to explain you why the R "default random generation" functions are not perfect). However, it may be fun to generate a normal distribution from a simple uniform distribution.

The Brownian motion is certainly the most famous stochastic process (a random variable evolving in the time). It has been the first way to model a stock option price (Louis Bachelier's thesis in 1900).

The reason why is easy to understand, a Brownian motion is graphically very similar to the historical price of a stock option.
1

The merge of two insurance companies enables to curb the probability of ruin by sharing the risk and the capital of the two companies.

For example, we can consider two insurance companies, A and B. A is a well known insurance company with a big capital and is dealing with a risk with a low variance. We will assume that the global risk of all its customers follow a chi-square distribution with one degree of freedom.
Blog Archive
Translate
Translate
Loading