Transforming the OYA: Difference between revisions

Transforming the OYA (view source)

Revision as of 14:44, 13 July 2023

99 bytes added , 1 year ago

m

code tags

Will

Bureaucrats, Administrators

80,516

edits

@@ Line 8: / Line 8: @@
 The first step was to extract a workable table from the OCRed version of the PDF. I copied the text of the Sezen pdf and pasted it into a word processor document. Initially this was challenging, because the multiple lines of the tables were not conforming to the column arrangement when imported into a spreadsheet. It took me a while to figure out a method, as I couldn't seem to parse it by all caps or the alphabet switch. Luckily I realized that the font size of the Arabic-alphabet text differed. I used changes in font size to split the table into three columns: New alphabet name, old alphabet name, and a single field consisting of a long string describing the administrative classification of the place as well as the larger district containing it (Unvan ve Bağlı Olduğu Nahiye, Kaza, Sancak, Eyâlet veya Vilâyet). I used find and replace with the following regular expressions to create these tab-delineated lines:
-. replace `$` with `#` (to remove all line breaks)
+. replace <code>$</code> with <code>#</code> (to remove all line breaks)
-. delete `##[0-9]*` (to remove page numbers)
+. delete <code>##[0-9]*</code> (to remove page numbers)
-. replace <code>[[:::CharHeight=11::]]</code> with `\n&` (to create a line break between each "new alphabet" administrative unit name, which are all given in 11pt font).
+. replace <code>[[:::CharHeight=11::]]</code> with <code>\n&</code> (to create a line break between each "new alphabet" administrative unit name, which are all given in 11pt font).
-. replace `[[:::CharHeight=12::]]` with `\t&\t` (to create tabs before and after every "old alphabet" administrative unit name, which are given in 12 point foint).
+. replace <code>[[:::CharHeight=12::]]</code> with <code>\t&\t</code> (to create tabs before and after every "old alphabet" administrative unit name, which are given in 12 point foint).
 For some reason these find and replace operations took _forever_ in Libreoffice. When I broke the document into 8,000 word chunks, it went a lot faster. I pasted the results into a spreadsheet and cleaned up a few errors manually. There were about 6400 records in the table that resulted, which is archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-1.csv Ottgaz-data-1]'''.
@@ Line 29: / Line 29: @@
 === 6. Translation of headings ===
-The sixth iteration converts the headings into English, and continues the cleaning and standardization. I reversed the numbering of the containing units, so now `Belongs_to_1-1` designates the largest unit. This version, comprising 454 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-6.tsv Ottgaz-data-6]**.
+The sixth iteration converts the headings into English, and continues the cleaning and standardization. I reversed the numbering of the containing units, so now <code>Belongs_to_1-1</code> designates the largest unit. This version, comprising 454 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-6.tsv Ottgaz-data-6]**.
 === 7. Introduction of database elements ===
-The seventh iteration begins to implement some internal references, by adding a field for ID numbers, and beginning to reference IDs in the hierarchy. It also adds a `Place_type` field, differentiating between cities and regions, and an `Ott_or_not` field, a boolean category to isolate those records which situate the place in a hierarchy entirely outside of modern Turkey. This version, comprising 818 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-7.tsv Ottgaz-data-7]**.
+The seventh iteration begins to implement some internal references, by adding a field for ID numbers, and beginning to reference IDs in the hierarchy. It also adds a <code>Place_type</code> field, differentiating between cities and regions, and an <code>Ott_or_not</code> field, a boolean category to isolate those records which situate the place in a hierarchy entirely outside of modern Turkey. This version, comprising 818 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-7.tsv Ottgaz-data-7]**.
 === 8. Reconciliation and RDF skeleton ===

Transforming the OYA: Difference between revisions

Transforming the OYA (view source)

Revision as of 14:44, 13 July 2023

Navigation menu

Search