I am currently doing an internship in England. Therefore, I keep alternating between French and English in my different emails and other forms of communication on the Internet. I have been surprised to see that some websites are able to recognize when I use French or when I use English. For example, Facebook automatically proposes me to translate. I was really amazed by this ability, how can a computer know what language I am using? Especially, I use a QWERTY keyboard, which means without accent. Therefore, I don’t write in a proper French and I never use accents.
I remembered some courses of data analysis and of computer
science where the issue was to determine the group whom an individual belonged
to.
The problem:
If I choose a text, how can I make my computer know if it’s
an English text or a French text without accent? Here we try to differentiate
without accent since it would be a very (too much) easy way to do it. In
particular, how can I do if I can’t have access to French and English
dictionaries?
The model:
We assume we have a sample of French and English texts which
will be used as benchmarks. We call frenchText(i) the i-th French text we have,
and englishText(j) the j-th English text we have. The aim is to determine if
the text called TEXT is French or English.
For every text, we count the proportion of every letter. We only take into account the 26 alphabet letters in order to keep
the program as simple as possible. Then we compute the average for all the
French texts and for all the English texts. In such a way we obtain the normal
proportion of each letter in an “average” French (or English) text.
Then, we count the proportion of occurrence of the
letters in TEXT and we compute the Euclidean distance between TEXT and the
average French text (we call this distance d(TEXT, averageF)) and between TEXT
and the average English text (we call this distance d(TEXT, averageE)). If
d(TEXT, averageF) is greater than d(TEXT, averageE) we consider that TEXT is
more similar to the English texts and therefore is certainly English.
Respectively, if d(TEXT, averageF) < d(TEXT, averageE), we consider the text
to be French.
![]() |
The Euclidean distance between two vectors p and q of dimension n. |
What is interesting is that, once we have considered a first
TEXT, we can, according to the decision we made, take into account TEXT in
the calculation of the average letter occurrences in the French texts (respectively
English texts), if it has been considered as a French text (resp. as an English
text). And then we can use this more accurate average to determine the language
of the new text. This is why it is called statistical learning. The more text we
compute, the more likely the decision is right.
It would be
interesting to use a weighted average in order to take into account the
probability of the event [TEXT is French] (resp. [TEXT is English]).
The results:
For the initialization, I have used two French texts and two
English texts. Each text is small (about 100 words).
Then I have tested the program on different French and
English texts. The results are relevant with the languages of the texts.
If you want to run this program, you will need the files
.txt. You can find them on my Google page : https://sites.google.com/site/probaperceptionstock/,
the file is a ZIP called textPost5. You can unzip and save this file on your computer in any folder you want.
Warnings:
Warnings:
This method must be use carefully, if the first tests on the first texts are wrong, then, the results are likely to be inconsistent. Indeed, a wrong proportion of letters has consequences on the determination of a text.
Besides, we use the Euclidean distance in order to make a clean and simple presentation. However, it is far from being the best distance for such a problem. Data analysis methods, may well be more relevant. For example, as I explain in a previous post (http://probaperception.blogspot.co.uk/2012/09/v-behaviorurldefaultvmlo.html) the Mahalanobis distance could be interesting in this context.
Besides, we use the Euclidean distance in order to make a clean and simple presentation. However, it is far from being the best distance for such a problem. Data analysis methods, may well be more relevant. For example, as I explain in a previous post (http://probaperception.blogspot.co.uk/2012/09/v-behaviorurldefaultvmlo.html) the Mahalanobis distance could be interesting in this context.
The code (R):
setwd("U:/Blog/Post5")
#you will have to change this directory according to your own folder
#you will have to change this directory according to your own folder
countLetter = function(lowerCase,upperCase, myTable){
if(is.na(myTable[lowerCase]+myTable[upperCase])){
if(is.na(myTable[lowerCase])){return(myTable[upperCase])}
else{return(myTable[lowerCase])}
}
else{return(myTable[lowerCase]
+ myTable[upperCase])}
}
proportion = function(myText){
myTextSplit = strsplit(myText,NULL)
table = table(myTextSplit)
a = countLetter("a", "A", table)
b = countLetter("b", "B", table)
c = countLetter("c", "C", table)
d = countLetter("d", "D", table)
e = countLetter("e", "E", table)
f = countLetter("f", "F", table)
g = countLetter("g", "G", table)
h = countLetter("h", "H", table)
i = countLetter("i", "I", table)
j = countLetter("j", "J", table)
k = countLetter("k", "K", table)
l = countLetter("l", "L", table)
m = countLetter("m", "M", table)
n = countLetter("n", "N", table)
o = countLetter("o", "O", table)
p = countLetter("p", "P", table)
q = countLetter("q", "Q", table)
r = countLetter("r", "R", table)
s = countLetter("s", "S", table)
t = countLetter("t", "T", table)
u = countLetter("u", "U", table)
v = countLetter("v", "V", table)
w = countLetter("w", "W", table)
x = countLetter("x", "X", table)
y = countLetter("y", "Y", table)
z = countLetter("z", "Z", table)
total = sum(c(a, b , c,
d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z), na.rm = T)
list = c(a, b , c,
d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z)/total
list[which(is.na(list))] = 0
return(list)
}
distance = function(myVector, myGroup){
mean =
apply(myGroup,2,mean)
dist = 0
for (i in
1:length(myVector)){
dist
= dist + (mean[i]-myVector[i])**2
}
return(sqrt(dist))
}
choice = function(myVector, mypropF, mypropE){
if(distance(myVector,
mypropF) > 1.1*distance(myVector, mypropE)){
cat("the
text is certainly English")
mypropE
= rbind(mypropE, myVector)
}
else
if(distance(myVector, mypropE) > 1.1*distance(myVector, mypropF)){
cat("the
text is certainly French")
mypropF
= rbind(mypropF, myVector)
}
x =
vector("list", 2)
x[[1]] =
mypropF
x[[2]] =
mypropE
return(x)
}
fileName = 'tf1.txt'
tf1 = readChar(fileName, file.info(fileName)$size)
fileName = 'tf2.txt'
tf2 = readChar(fileName, file.info(fileName)$size)
fileName = 'te1.txt'
te1 = readChar(fileName, file.info(fileName)$size)
fileName = 'te2.txt'
te2 = readChar(fileName, file.info(fileName)$size)
fileName = 'tm1.txt'
tm1 = readChar(fileName, file.info(fileName)$size)
propM1 = proportion(tm1)
#French Text
#French Text
fileName = 'tm2.txt'
tm2 = readChar(fileName, file.info(fileName)$size)
propM2 = proportion(tm2)
#French Text
fileName = 'tm3.txt'
tm3 = readChar(fileName, file.info(fileName)$size)
propM3 = proportion(tm3)
#French Text
fileName = 'tm4.txt'
tm4 = readChar(fileName, file.info(fileName)$size)
propM4 = proportion(tm4)
#English Text
fileName = 'tm5.txt'
tm5 = readChar(fileName, file.info(fileName)$size)
propM5 = proportion(tm5)
#English Text
fileName = 'tm6.txt'
tm6 = readChar(fileName, file.info(fileName)$size)
propM6 = proportion(tm6)
#English Text
list = choice(propM1, propF, propE)
propF = list[[1]]
propE = list[[2]]
list = choice(propM2, propF, propE)
propF = list[[1]]
propE = list[[2]]
list = choice(propM3, propF, propE)
propF = list[[1]]
propE = list[[2]]
list = choice(propM4, propF, propE)
propF = list[[1]]
propE = list[[2]]
list = choice(propM5, propF, propE)
propF = list[[1]]
propE = list[[2]]
list = choice(propM6, propF, propE)
propF = list[[1]]
propE = list[[2]]
Add a comment