Subsections

4. The Language Frequencies Utility

Ganzúa includes files with the standard relative frequencies of some languages, but probably not for the language and/or alphabet you want to use. The language frequencies utility langFreq.jar can get the standard relative frequencies of a language from a text file for an arbitrary alphabet. It is a command line utility, which means that it does not provide a graphical user interface and in order to use it you'll need a terminal^4.1.

For the relative frequencies to be representative of the language, the text files you get them from must be as big as possible. The files used to get the language frequencies included with Ganzúa were obtained from novels downloaded from Project Gutenberg's site (http://www.promo.net/pg/ ). Specifically, the statistics for the English language were obtained from David Copperfield by Charles Dickens and those of the Spanish language from El Ingenioso Hidalgo Don Quijote de la Mancha by Miguel de Cervantes Saavedra. Unlike Ganzúa, that is not meant to be used with large documents, this utility can handle big text files.

The information about the text file and the alphabet you want to use should be placed in an XML file like the ones in Ganzúa's examples/alphabetRules/en directory. These XML documents are instances of the AlphabetRules.xsd schema. They are called alphabet rules because they specify the characters to be included in the alphabet and how those that will not be included should be handled.

If you find the next section difficult to follow or would like to know a bit about XML before you start writing your own XML files, skim through one of the many XML tutorial available on the Internet, like the one at http://www.w3schools.com

4.1 Alphabet Rules XML Files

Open one of the example alphabet rules files (examples/alphabetRules/en) with your favorite text editor (or XML editor) so you can see a complete example as the different sections of this kind of XML file are explained.

<?xml version="1.0" encoding="UTF-8"?>

The first line must indicate the encoding used by the XML file. If you don't know which encoding you are using, it is probably your system's default. You can use Ganzúa to find out which is your platform's default encoding (see Open Ciphertext in section 3.2.1).

<alphabetRules xmlns="http://ganzua.sourceforge.net/rules" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ganzua.sourceforge.net/rules ../schemata/AlphabetRules.xsd" language="en" country="GB" source="/home/user/cprfd10.txt" sourceEncoding="ISO-8859-15">

The next few lines include the opening tag of the alphabetRules element and its attributes. The only attributes you should modify^4.2 are those that specify characteristics of the text file you'll get the relative frequencies from:

language: two character ISO 639 code of the language the text file is in.
country: two character ISO 3166 code of the country the text in the file originated from. This attribute is used to identify the differences in the use of a language by different countries, for example the English spoken in the United Kingdom and the United States of America.
This attribute is not required, so you may omit it entirely.
source: the text file's full path.
sourceEncoding: the Encoding used in the text file.

The next elements specify the characters that make up the alphabet. This can be done in two different ways:

List all of the characters in the alphabet
Specify a set of characters to include (even if they do not appear in the text file) and a set of characters to ignore.
Any character in the text file that is not among the ignored characters will be considered part of the alphabet.

If you want to simply list all of the characters you want in the alphabet, use the includeExclusively element.

<includeExclusively> <character char="A" /> <character char="B" /> <character char="C" /> <character char="D" /> </includeExclusively>

Each character element specifies a character to be included in the alphabet in its char attribute. Any character in the text file that is not in the includeExclusively element will be ignored when the program gets the relative frequencies. GB_lowercase.xml and GB_uppercase.xml are examples of alphabet rules files that use this tag.

Remember:: New line, space and control characters are considered special by Ganzúa and may not be part of the cipher or plain alphabets, any other character is valid.
Do not put special characters inside includeExclusively.

To let langFreq.jar add characters to the cipher alphabet as it finds them in the text file, use the include and ignore elements. In the following example, some characters are specified using references to Unicode Standard characters (e.g. 
 for the line feed character) or entities (e.g. " for the character "). The entities used to reference Unicode characters are of the form &#NUM; where NUM is the decimal number (not hexadecimal) of the character in the Unicode charts. If you wish to learn more about Unicode, visit http://www.unicode.org .

All of the characters in the include tag will be in the alphabet even if they do not appear in the text file.

The characters in the ignore element will not be added to the alphabet even if they appear in the text file.

Important:: Do not put characters that appear in the include element inside ignore. Doing so will make those characters appear in the alphabet but be ignored when the relative frequencies are obtained. They will have a frequency of 0 and will not appear in bigrams or trigrams.

Since new line, space and control characters are considered special by Ganzúa, they should not appear in the include element and should always be ignored. In the example above the space, tab, line feed, carriage return and new line characters are ignored. If you use the include tab, at the very least these characters should appear in the ignore element.

GB_lowerIgnr.xml and GB_uppeIgnr.xml are examples of alphabet rules files that use the include and ignore tags.

The next elements tell langFreq.jar to handle occurrences of a given character as a different character. The utility does not do this automatically. If your alphabet contains uppercase character exclusively (as in an includeExclusively element), only those characters will be considered and any occurrence of a lowercase character will be handled as if it did not exist. That is, unless the replace element specifies that the lowercase character should be handled like a character in the alphabet.

<replace> <occurrences ofChar="a" byChar="A" /> <occurrences ofChar="b" byChar="B" /> <occurrences ofChar="c" byChar="C" /> <occurrences ofChar="d" byChar="D" /> </replace>

This way you could make the utility consider the character Ñ as N, etc.

Note that if you specify that a character should count as one to be ignored in replace, it will be ignored. You should also remember that langFreq.jar does not replace the characters recursively. If the following is inside a replace element:

<occurrences ofChar="a" byChar="A" /> <occurrences ofChar="A" byChar="X" />

The occurrences of the character a in the text file will count as A, and those of A as X, but a will not count as X.

All of the examples of alphabet rules files provided with Ganzúa use the replace element.

4.2 Using langFreq.jar

Once you have a text file to get relative frequencies from and an alphabet rules XML file for it, you'll be able to use langFreq.jar .

As mentioned earlier, langFreq.jar is a command line utility. Open a command line terminal and change to the directory that contains your alphabet rules file. Now use the command:

java -jar GANZÚA_HOME/langFreq.jar alphabetRules.xml

Where GANZÚA_HOME is the directory that contains langFreq.jar and alphabetRules.xml is the name of your alphabet rules XML file. This will make langFreq.jar parse your XML file, report if it finds errors in its construction and if that is not the case, generate a language frequencies XML file. Then the program will report the directory the file was written to and its name. By default the utility will try to place the new language frequencies file in Ganzúa's language frequencies directory, but if you are not allowed to write to that directory, it will be placed in your home directory. The file will be named using the two character code of the language and a number.

If you want to specify the directory and name of the file langFreq.jar should write to, use the -o option. For example

java -jar GANZÚA_HOME/langFreq.jar -o frequencies.xml alphabetRules.xml

This will make langFreq.jar put the language frequencies in the file frequencies.xml in the current directory.

If you make new language frequencies files, please consider donating them to the Ganzúa project, specially if they are of languages for which none are provided. Chapter 5 explains how you may contribute to Ganzúa.

Footnotes

... terminal ^4.1: In Mac OS X the terminal can be found in Applications/Utilities
... modify ^4.2: Change the value in quotation marks, but keep the marks.

Jesús Adolfo García Pasquel 2004-10-04