Transforming the OYA: Difference between revisions

Jump to navigation Jump to search
m
formatting
m (code tags)
m (formatting)
Line 16: Line 16:


=== 2. Creating columns ===
=== 2. Creating columns ===
The next step was to produce a single line for each place listed, with every associated unit class/district (unvan/bağlı) pair broken out. I used OpenRefine to parse the columns before and after the unvan names, which were quite standard and limited in number. Almost half of the units are associated with only one class/district pair, which is to say they were only located under one date. At the other extreme, Kozan has 14 (!) pairs, which is to say that (according to Sezen) its administrative status changed 14 times over the history of the Ottoman Empire. It took a bit more than a day to do this. This version is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-2.tsv Ottgaz-data-2]**.
The next step was to produce a single line for each place listed, with every associated unit class/district (unvan/bağlı) pair broken out. I used OpenRefine to parse the columns before and after the unvan names, which were quite standard and limited in number. Almost half of the units are associated with only one class/district pair, which is to say they were only located under one date. At the other extreme, Kozan has 14 (!) pairs, which is to say that (according to Sezen) its administrative status changed 14 times over the history of the Ottoman Empire. It took a bit more than a day to do this. This version is archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-2.tsv Ottgaz-data-2]'''.


=== 3. Breaking districts into multiple columns ===
=== 3. Breaking districts into multiple columns ===
Next I broke concatenated district entries (for example, İstefan→Sinop→Kastamonu→Kastamonu vilâyeti) into discrete ordered columns. This version (I think) is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-3.csv Ottgaz-data-3]**.
Next I broke concatenated district entries (for example, İstefan→Sinop→Kastamonu→Kastamonu vilâyeti) into discrete ordered columns. This version (I think) is archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-3.csv Ottgaz-data-3]'''.


=== 4. Cleaning and standardizing ===
=== 4. Cleaning and standardizing ===
I then used [http://openrefine.org/ OpenRefine] to format and clean the data. This was a laborious process, and went through several iterations. I've archived all of these iterations as .tsv files and as OpenRefine projects, though I expect that they will be of little interest or use to anyone. This fourth iteration contains all of the bağlı hierarchical names in a single column, which is to say that it does not have a single row for each placename. It contains 38,465 rows. I used this method to standardize expressions of placenames and dates using OpenRefine's clustering tools. This version, comprising 885 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-4.tsv Ottgaz-data-4]**.
I then used [http://openrefine.org/ OpenRefine] to format and clean the data. This was a laborious process, and went through several iterations. I've archived all of these iterations as .tsv files and as OpenRefine projects, though I expect that they will be of little interest or use to anyone. This fourth iteration contains all of the bağlı hierarchical names in a single column, which is to say that it does not have a single row for each placename. It contains 38,465 rows. I used this method to standardize expressions of placenames and dates using OpenRefine's clustering tools. This version, comprising 885 edits, is archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-4.tsv Ottgaz-data-4]'''.


=== 5. Restructuring column hierarchy ===
=== 5. Restructuring column hierarchy ===
The fifth iteration contains only one line for every place, and begins to elaborate the hierarchical structure. It stretches to 80 columns. I standardized the bağlı hierarchical language (which contained a lot of errors, some of which remain no doubt). I have given each nesting bağlı place name its own column, so, for example, İstefan→Sinop→Kastamonu→Kastamonu vilâyeti becomes 4 columns. These are numbered in hierarchical sequence: "Bağlı 1 1" is the smallest district containing the place in question (İstefan in this example), and Bağlı 2 etc are larger units. I have also broken out dates into their own columns, and standardized the titles (unvanlar), which will form part of the ontology of the gazetteer. Finally, I have amalgamated "unvan" notes and "bağlı" notes (anything other than a district type or a placename) into a single "notes" column. This version, comprising 873 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-5.tsv Ottgaz-data-5]**.
The fifth iteration contains only one line for every place, and begins to elaborate the hierarchical structure. It stretches to 80 columns. I standardized the bağlı hierarchical language (which contained a lot of errors, some of which remain no doubt). I have given each nesting bağlı place name its own column, so, for example, İstefan→Sinop→Kastamonu→Kastamonu vilâyeti becomes 4 columns. These are numbered in hierarchical sequence: "Bağlı 1 1" is the smallest district containing the place in question (İstefan in this example), and Bağlı 2 etc are larger units. I have also broken out dates into their own columns, and standardized the titles (unvanlar), which will form part of the ontology of the gazetteer. Finally, I have amalgamated "unvan" notes and "bağlı" notes (anything other than a district type or a placename) into a single "notes" column. This version, comprising 873 edits, is archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-5.tsv Ottgaz-data-5]'''.
<!-- account for multiple initial place name columns? -->
<!-- account for multiple initial place name columns? -->


=== 6. Translation of headings ===
=== 6. Translation of headings ===
The sixth iteration converts the headings into English, and continues the cleaning and standardization. I reversed the numbering of the containing units, so now <code>Belongs_to_1-1</code> designates the largest unit. This version, comprising 454 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-6.tsv Ottgaz-data-6]**.
The sixth iteration converts the headings into English, and continues the cleaning and standardization. I reversed the numbering of the containing units, so now <code>Belongs_to_1-1</code> designates the largest unit. This version, comprising 454 edits, is archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-6.tsv Ottgaz-data-6]'''.


=== 7. Introduction of database elements ===
=== 7. Introduction of database elements ===
The seventh iteration begins to implement some internal references, by adding a field for ID numbers, and beginning to reference IDs in the hierarchy. It also adds a <code>Place_type</code> field, differentiating between cities and regions, and an <code>Ott_or_not</code> field, a boolean category to isolate those records which situate the place in a hierarchy entirely outside of modern Turkey. This version, comprising 818 edits, is archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-7.tsv Ottgaz-data-7]**.
The seventh iteration begins to implement some internal references, by adding a field for ID numbers, and beginning to reference IDs in the hierarchy. It also adds a <code>Place_type</code> field, differentiating between cities and regions, and an <code>Ott_or_not</code> field, a boolean category to isolate those records which situate the place in a hierarchy entirely outside of modern Turkey. This version, comprising 818 edits, is archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-7.tsv Ottgaz-data-7]'''.


=== 8. Reconciliation and RDF skeleton ===
=== 8. Reconciliation and RDF skeleton ===
The eighth iteration, which I'm working on now, should be the last produced in OpenRefine. It integrates wikidata, geonames, and periodo references and contains an RDF skeleton. This version *will be* archived as **[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-7.tsv Ottgaz-data-8]**.
The eighth iteration, which I'm working on now, should be the last produced in OpenRefine. It integrates wikidata, geonames, and periodo references and contains an RDF skeleton. This version *will be* archived as '''[https://github.com/whanley/Ottoman-Gazetteer/blob/master/data/archived-versions/ottgaz-data-7.tsv Ottgaz-data-8]'''.


=== Rough plans: ===
=== Rough plans: ===
- I will combine rows which refer to the same place (many places are listed twice, under different names).
* I will combine rows which refer to the same place (many places are listed twice, under different names).
- I will give each administrative unit a number
* I will give each administrative unit a number
- I will align the dates with a periodization of sultan's epochs that I created on [perio.do PerioDo].
* I will align the dates with a periodization of sultan's epochs that I created on [perio.do PerioDo].
- I will substitute the appropriate number for each "containing district" (bağlı olduğu...) field
* I will substitute the appropriate number for each "containing district" (bağlı olduğu...) field
- I will produce an RDF skeleton and export the data, conforming to the Pelagios [Gazetteer Interchange Format](https://github.com/pelagios/pelagios-cookbook/wiki/Pelagios-Gazetteer-Interconnection-Format)
* I will produce an RDF skeleton and export the data, conforming to the Pelagios [Gazetteer Interchange Format](https://github.com/pelagios/pelagios-cookbook/wiki/Pelagios-Gazetteer-Interconnection-Format)
- I will wrangle this onto the [ottgaz.org] server, using [http://r.duckduckgo.com/l/?kh=-1&uddg=http%3A%2F%2Faksw.org%2FProjects%2FOntoWiki OntoWiki] at least initially
* I will wrangle this onto the [ottgaz.org] server, using [http://r.duckduckgo.com/l/?kh=-1&uddg=http%3A%2F%2Faksw.org%2FProjects%2FOntoWiki OntoWiki] at least initially

Navigation menu