Last edited 15 December 2010 by sanoica1
Find this document at:

Leve Roots: A history of Indo-European languages using Levenshtein distance

Abstract

This project will apply the Levenshtein Distance Algorithm to calculate the minimum edit distance between four words. The words are semantically the same but each word is representative of four distinct languages. The resultant distances will be used to compare how close one language is from another.

Background

Edit distance is defined as the number of operations required to transform one string into another. Operations include substitutions, insertions, deletions, and, in some cases, swaps. Substitutions involve replacing one character for another. Insertions/deletions add or subtract characters, and swaps switch the position of two adjacent characters. For example, the distance between "golf" and "glove" could be 3: g olf and glove; gol f and g-lo-ve; and g-ol-f and g-lo-ve. It is also possible to say the edit distance between golf and glove is 4, by substituting l-o, o-l, f-v, and inserting an e. The actual total value can vary depending on the assigned value of each operation. A program can assign one value to insertions/deletions, and a higher value to swaps and substitutions, for example.

The Levenshtein Edit Distance Algorithm calculates the minimum number of operations necessary to transform one string into another by displaying all possible combinations in a matrix. Because of the algorithm’s string flexibility, words do not need to have an identical number of letters in each word. The matrix is set up by assigning two strings: a source string (s) and a target string (t). The source string and the target string index the rows and columns of the two-dimensional array. In the uppermost, left-hand corner cell, the value is zero. The algorithm travels diagonally and throughout the matrix, and the bottom right-hand corner cell displays the minimum edit distance between the source string and target string. Each cell value is determined by the values of the immediate surrounding cell.

The top horizontal row is always an increasing total of insertions. The cell below the first letter has a value of 1, the second has 2... until value n, the total number of characters in the source string. These are also the cost of insertions. A similar pattern for the first column exists for the vertical target string. To fill in a cell, take the minimum value of either the value of the cell directly above plus the value of a deletion, the value of the diagonal cell plus the value of a substitution (if necessary), or the value of the left cell plus the value of an insertion.

A good demo of the Levenshtein Algorithm was developed by Peter Kleiweg. You can find it at http://odur.let.rug.nl/%7Ekleiweg/lev/

In linguistics, languages are categorized into language families. This project will specifically compare words from Indo-European languages. Indo-European languages include Italic, Celtic, Germanic, Slavic, Baltic, Hellenic, Albanian, Armenian, and Indo-Iranian languages. The Italic languages are no longer spoken, but the resultant Romance languages descendent of the Italic languages have a strong presence in the world today. The major Romance languages include Italian, French, Spanish, Portuguese, Catalan, and Romanian.

Application

This project will graphically demonstrate the relationship of words from language families by mapping the Levenshtein distances between each word on a 3D plane. Insertions, deletions, and substitutions will all maintain a value of 1. The smaller the edit distance, the closer the two languages, and thus the closer the two vertices, will be to each other. Similarly, the larger the edit distance, the further apart the two languages (two vertexes) will be. Closer languages will be in similar language families, while more distant languages will be from different language families.