Ganzúa includes files with the standard relative frequencies of some languages, but probably not for the language and/or alphabet you want to use. The language frequencies utility langFreq.jar can get the standard relative frequencies of a language from a text file for an arbitrary alphabet. It is a command line utility, which means that it does not provide a graphical user interface and in order to use it you'll need a terminal4.1.
For the relative frequencies to be representative of the language, the text files you get them from must be as big as possible. The files used to get the language frequencies included with Ganzúa were obtained from novels downloaded from Project Gutenberg's site (http://www.promo.net/pg/ ). Specifically, the statistics for the English language were obtained from David Copperfield by Charles Dickens and those of the Spanish language from El Ingenioso Hidalgo Don Quijote de la Mancha by Miguel de Cervantes Saavedra. Unlike Ganzúa, that is not meant to be used with large documents, this utility can handle big text files.
The information about the text file and the alphabet you want to use should be placed in an XML file like the ones in Ganzúa's examples/alphabetRules/en directory. These XML documents are instances of the AlphabetRules.xsd schema. They are called alphabet rules because they specify the characters to be included in the alphabet and how those that will not be included should be handled.
If you find the next section difficult to follow or would like to know a bit about XML before you start writing your own XML files, skim through one of the many XML tutorial available on the Internet, like the one at http://www.w3schools.com
Open one of the example alphabet rules files (examples/alphabetRules/en) with your favorite text editor (or XML editor) so you can see a complete example as the different sections of this kind of XML file are explained.
The first line must indicate the encoding used by the XML file. If you don't know which encoding you are using, it is probably your system's default. You can use Ganzúa to find out which is your platform's default encoding (see Open Ciphertext in section 3.2.1).
The next few lines include the opening tag of the alphabetRules element and its attributes. The only attributes you should modify4.2 are those that specify characteristics of the text file you'll get the relative frequencies from:
The next elements specify the characters that make up the alphabet. This can be done in two different ways:
If you want to simply list all of the characters you want in the alphabet, use the includeExclusively element.
Each character element specifies a character to be included in the alphabet in its char attribute. Any character in the text file that is not in the includeExclusively element will be ignored when the program gets the relative frequencies. GB_lowercase.xml and GB_uppercase.xml are examples of alphabet rules files that use this tag.
To let langFreq.jar add characters to the cipher alphabet as it finds them in the text file, use the include and ignore elements. In the following example, some characters are specified using references to Unicode Standard characters (e.g. for the line feed character) or entities (e.g. " for the character "). The entities used to reference Unicode characters are of the form &#NUM; where NUM is the decimal number (not hexadecimal) of the character in the Unicode charts. If you wish to learn more about Unicode, visit http://www.unicode.org .
All of the characters in the include tag will be in the alphabet even if they do not appear in the text file.
The characters in the ignore element will not be added to the alphabet even if they appear in the text file.
Since new line, space and control characters are considered special by Ganzúa, they should not appear in the include element and should always be ignored. In the example above the space, tab, line feed, carriage return and new line characters are ignored. If you use the include tab, at the very least these characters should appear in the ignore element.
GB_lowerIgnr.xml and GB_uppeIgnr.xml are examples of alphabet rules files that use the include and ignore tags.
The next elements tell langFreq.jar to handle occurrences of a given character as a different character. The utility does not do this automatically. If your alphabet contains uppercase character exclusively (as in an includeExclusively element), only those characters will be considered and any occurrence of a lowercase character will be handled as if it did not exist. That is, unless the replace element specifies that the lowercase character should be handled like a character in the alphabet.
This way you could make the utility consider the character Ñ as N, etc.
Note that if you specify that a character should count as one to be ignored in replace, it will be ignored. You should also remember that langFreq.jar does not replace the characters recursively. If the following is inside a replace element:
The occurrences of the character a in the text file will count as A, and those of A as X, but a will not count as X.
All of the examples of alphabet rules files provided with Ganzúa use the replace element.
Once you have a text file to get relative frequencies from and an alphabet rules XML file for it, you'll be able to use langFreq.jar .
As mentioned earlier, langFreq.jar is a command line utility. Open a command line terminal and change to the directory that contains your alphabet rules file. Now use the command:
Where GANZÚA_HOME is the directory that contains langFreq.jar and alphabetRules.xml is the name of your alphabet rules XML file. This will make langFreq.jar parse your XML file, report if it finds errors in its construction and if that is not the case, generate a language frequencies XML file. Then the program will report the directory the file was written to and its name. By default the utility will try to place the new language frequencies file in Ganzúa's language frequencies directory, but if you are not allowed to write to that directory, it will be placed in your home directory. The file will be named using the two character code of the language and a number.
If you want to specify the directory and name of the file langFreq.jar should write to, use the -o option. For example
This will make langFreq.jar put the language frequencies in the file frequencies.xml in the current directory.
If you make new language frequencies files, please consider donating them to the Ganzúa project, specially if they are of languages for which none are provided. Chapter 5 explains how you may contribute to Ganzúa.