Transforming the OYA: Difference between revisions

m
code tags
m (code tags)
Line 8: Line 8:
The first step was to extract a workable table from the OCRed version of the PDF. I copied the text of the Sezen pdf and pasted it into a word processor document. Initially this was challenging, because the multiple lines of the tables were not conforming to the column arrangement when imported into a spreadsheet. It took me a while to figure out a method, as I couldn't seem to parse it by all caps or the alphabet switch. Luckily I realized that the font size of the Arabic-alphabet text differed. I used changes in font size to split the table into three columns: New alphabet name, old alphabet name, and a single field consisting of a long string describing the administrative classification of the place as well as the larger district containing it (Unvan ve Bağlı Olduğu Nahiye, Kaza, Sancak, Eyâlet veya Vilâyet). I used find and replace with the following regular expressions to create these tab-delineated lines:
The first step was to extract a workable table from the OCRed version of the PDF. I copied the text of the Sezen pdf and pasted it into a word processor document. Initially this was challenging, because the multiple lines of the tables were not conforming to the column arrangement when imported into a spreadsheet. It took me a while to figure out a method, as I couldn't seem to parse it by all caps or the alphabet switch. Luckily I realized that the font size of the Arabic-alphabet text differed. I used changes in font size to split the table into three columns: New alphabet name, old alphabet name, and a single field consisting of a long string describing the administrative classification of the place as well as the larger district containing it (Unvan ve Bağlı Olduğu Nahiye, Kaza, Sancak, Eyâlet veya Vilâyet). I used find and replace with the following regular expressions to create these tab-delineated lines:


1. replace `$` with `#` (to remove all line breaks)
1. replace <code>$</code> with <code>#</code> (to remove all line breaks)
2. delete `##[0-9]*` (to remove page numbers)
2. delete <code>##[0-9]*</code> (to remove page numbers)
3. replace <code>[[:::CharHeight=11::]]</code> with `\n&` (to create a line break between each "new alphabet" administrative unit name, which are all given in 11pt font).
3. replace <code>[[:::CharHeight=11::]]</code> with <code>\n&</code> (to create a line break between each "new alphabet" administrative unit name, which are all given in 11pt font).
4. replace `[[:::CharHeight=12::]]` with `\t&\t` (to create tabs before and after every "old alphabet" administrative unit name, which are given in 12 point foint).
4. replace <code>[[:::CharHeight=12::]]</code> with <code>\t&\t</code> (to create tabs before and after every "old alphabet" administrative unit name, which are given in 12 point foint).


For some reason these find and replace operations took _forever_ in Libreoffice. When I broke the document into 8,000 word chunks, it went a lot faster. I pasted the results into a spreadsheet and cleaned up a few errors manually. There were about 6400 records in the table that resulted, which is archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-1.csv Ottgaz-data-1]'''.
For some reason these find and replace operations took _forever_ in Libreoffice. When I broke the document into 8,000 word chunks, it went a lot faster. I pasted the results into a spreadsheet and cleaned up a few errors manually. There were about 6400 records in the table that resulted, which is archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-1.csv Ottgaz-data-1]'''.
Line 29: Line 29:


=== 6. Translation of headings ===
=== 6. Translation of headings ===
The sixth iteration converts the headings into English, and continues the cleaning and standardization. I reversed the numbering of the containing units, so now `Belongs_to_1-1` designates the largest unit. This version, comprising 454 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-6.tsv Ottgaz-data-6]**.
The sixth iteration converts the headings into English, and continues the cleaning and standardization. I reversed the numbering of the containing units, so now <code>Belongs_to_1-1</code> designates the largest unit. This version, comprising 454 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-6.tsv Ottgaz-data-6]**.


=== 7. Introduction of database elements ===
=== 7. Introduction of database elements ===
The seventh iteration begins to implement some internal references, by adding a field for ID numbers, and beginning to reference IDs in the hierarchy. It also adds a `Place_type` field, differentiating between cities and regions, and an `Ott_or_not` field, a boolean category to isolate those records which situate the place in a hierarchy entirely outside of modern Turkey. This version, comprising 818 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-7.tsv Ottgaz-data-7]**.
The seventh iteration begins to implement some internal references, by adding a field for ID numbers, and beginning to reference IDs in the hierarchy. It also adds a <code>Place_type</code> field, differentiating between cities and regions, and an <code>Ott_or_not</code> field, a boolean category to isolate those records which situate the place in a hierarchy entirely outside of modern Turkey. This version, comprising 818 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-7.tsv Ottgaz-data-7]**.


=== 8. Reconciliation and RDF skeleton ===
=== 8. Reconciliation and RDF skeleton ===