Today’s LODLIB update reflects datatype normalization and quality control checks across all of our GMarc datasets (Hahn, Zahn, Harnack, Tsutsui, BeDuhn, Roth, Klinghardt, Nicolotti). While we have only released the full text of the first three, since their print works are in the public domain, we have made use of all of this normalized data in our new data tabulations (3.7) and data visualizations (3.8). While our own iterative critical edition is still in progress, the counts and graphs for all earlier editions should now remain static, thus we are now comfortable building these data tabulations and visualizations into forthcoming journal articles and book reviews.
In other related news, Jason BeDuhn and I are meeting later today to discuss the Westar SBL session on Q and the Gospel of Marcion. Given our overlapping scholarly work, I’m very much looking forward to the conversation. I also received just today the proofs of my forthcoming data paper for the Journal of Open Humanities Data. It’s always nice to see one’s work as it’s about to go to (digital) press.
This week’s version continues our work to build out data normalization rules and standards for the academic/scientific study of the Gospel of Marcion. We’ve had another fruitful round of feedback about our Harnack datasets and short data paper for the Journal of Open Humanities Data. If we can get peer-reviewed agreement on the normalization of Harnack’s GMarc data, then normalizing the data of all of the other GMarc reconstructions will be far easier by comparison. In the meantime, in this week’s LODLIB, we have proposed new data normalization rules for the reconstructions of GMarc by Tsutsui (1992), Roth (2015), Klinghardt (2015/2020/2021) and Nicolotti (2019).
One of the great things about the LODLIB format is to visualize data while it is in process of peer-review and correction. The slew of data visualizations I released last week (another sample below) can easily be revised and updated if and when there are legitimate peer-reviewed corrections or consensus emerges about data normalization standards and/or the underlying normalized data. Visualizing data is so crucial to understand their importance and recognize their patterns, yet data are so often noisy, messy, and in fluctuation. Hence our modes of scholarly communication must adapt to accommodate these flexible processes, aiming for greater and greater clarity, fidelity, and scholarly consensus with each round of feedback and continuous improvement.
This week’s version initiates data normalization for the study of the Gospel of Marcion in concert with our freshly revised datasets for the fourth round of review of a short data paper and related datasets we have submitted to the Journal of Open Humanities Data, whose Editor-in-Chief is Barbara McGillivray at the Alan Turing Institute at Cambridge. The peer-review process has been wonderful and indeed transformative in my thinking and methodology.
The normalization of GMarc data (transforming past messy/noisy reconstructions into standardized data) will—mark my words—prove the tipping point in the transformation of the scholarly study of the canonical and non-canonical gospel strata into legitimate Data Science. In concert with our new normalization standards and normalized datasets of public domain reconstructions, we also release a slew of data visualizations illustrating the contents and relationships of all past GMarc reconstruction datasets. These visualizations clearly reinforce our scientific hypotheses and proofs that GMarc was in fact the third gospel stratum, based on two sources (the first gospel stratum, Qn, and an early version of Mark).
The age of hagiographical controlling bias and assumptions in Gospel Studies is over. The age of Gospel Data Science is upon us. Scholars can either get on board or get out of the way, but no matter what you do, you can’t stop this.
This week’s version puts us over 400,000 words. In concert with the peer-review of our Harnack 1924 datasets for the Journal of Open Humanities Data, we have compiled datasets for other closely related, public domain reconstructions of Marcion’s Gospel. Today’s release features Zahn’s 1892 reconstruction, the second major reconstruction in the history of scholarship. Zahn’s edition totals 10571 words, far less than Hahn’s 14400, yet far more than Harnack’s 4207 4338. The disparity between these reconstructions exemplifies how much the results of reconstruction are determined by a priori assumptions and methodologies. We anticipate adding granular word counts by passage and tradition type (single, double, triple) for the editions of Hahn and Zahn in the Data Dictionary (DD 1.6) of next week’s LODLIB update.
Lot’s of progress made in today’s upload. We’d specifically like to call attention to an expansion to our statistical proofs, especially in conversation with Daniel Smith’s 2019 chapter in BZNW 235 focusing on a statistical analysis of GMarc. In the interest of facilitating access for readers, we present the bulk of the content found on the page in our LODLIB that details our finding, building on Smith’s verse counts but nuancing them and challenging his starting goal (“On Not Dispensing with Any of Q”) and ultimate conclusions.
Smith Verse Count: GMarc Attested as a Percentage of Lk2
GMarc Verses Attested
GMarc Attested / Lk2
Even without questioning or changing any of the traditional contents considered secure for Q, according to Smith’s verse count approach, Q verses are the best attested of any tradition type. That is a highly significant finding on its own.
But what happens if we adjust our method to account separately for the 83 verses consideredbut doubted or rejected within CEQ? … [more below the fold]
Today’s upload has several columns completed in the internal Data Dictionary (DD 1.6), a quantitative tabular comparison of major editions of Marcion’s Gospel. Several new concluding tabular calculations are also now included.
Several major quantitative findings deserve comment:
BeDuhn’s 2013 edition, while in English, stakes out a moderate position in its scope and reconstructions, especially when compared with the appearance of several new maximalist editions
Roth’s 2015 edition is highly similar to Harnack’s minimalist reconstruction
Klinghardt’s 2015/2020/2021 edition is by far the most extensive attempt to restore Marcion’s Gospel, owing significantly to his confidence in Codex Bezae as a consistent and reliable witness to its text
Nicolotti’s 2019 edition is certainly influenced by Klinghardt’s, but pulls back significantly from its reconstruction, both in the total number of passages restored and the extent of the word count restored within those passages
These quantitative findings will feature in two forthcoming reviews, one with Vigiliae Christianae focused on Klinghardt’s edition and a second, more encompassing review for another journal.
For this post, we highlight one table that illustrates the above conclusions. It consists of a compilation of the passages in each edition of Marcion’s Gospel that exceed the total number of words in the respective parallel passages in the canonical Gospel of Luke.
Today’s upload has several columns completed in our new section of the internal Data Dictionary (DD 1.6), a tabular comparison of major editions of Marcion’s Gospel. Some concluding calculations are also now included.
Major finding: the same internal patterns of word count distribution for Single, Double, and Triple traditions that I previously found in my reconstruction also hold true for the reconstructions of Harnack, BeDuhn and Roth. We are making good progress on compiling datasets of the editions by Klinghardt and Nicolotti, but those columns aren’t yet complete. So far, though, no matter who is doing the editing/reconstructing, the data are clear. GMarc has a systematic lack of uniquely Lukan traditions and a systematic surplus of Double and especially Triple traditions when compared to Lk2. This is one of many compelling proofs that GMarc was in fact an earlier version of Luke.
Lk2 vs GMarc Internals
On a somewhat related note, we’ve recently joined the Association for Computational Linguistics (ACL), the ACM SIGKDD, and the Data Visualization Society. We look forward to bringing our scholarly work on the Gospels into conversation with members of these groups in conferences and publications soon and for years to come.
Today’s upload adds a significant new section to the internal Data Dictionary. DD 1.6 provides a tabular comparison of major editions of Marcion’s Gospel by Harnack, Roth, Klinghardt, Nicolotti, and myself. Thus far we have added verses, word counts, and attestation rates for the first few chapters. In future weeks, we plan to complete this table and add another section, 1.7, noting how specific linguistic features are rendered differently across these editions.
Even with the tabulations and calculations compiled thus far, the various methodological assumptions of the respective editors are already coming into focus. Klinghardt and Nicolotti consistently render more verses and more words within verses than do BeDuhn, Roth, or I. Harnack’s work is most closely followed by Roth, and both are minimalist renditions. Nicolotti follows Klinghardt most closely, and both are (overly) maximalist renditions (in my view). BeDuhn and I are moderate in our methods, attempting to render verses and words that were likely in GMarc even if not clearly attested by patristic witnesses, but not unnecessarily adding verses simply because they are present in Codex Bezae or have variant readings in the Luke manuscript tradition.
The other major addition to this version is a couple sample pages of TEI XML for Harnack’s version of Marcion’s Gospel. This sample is intended to give readers a preliminary sense of the XML structural and tagging conventions we plan to follow for our datasets.
This week’s edition puts us at nearly 720 pages and 300,000 words. This is the week where our research really started to integrate with RStudio. We spent quite a bit of time troubleshooting Greek unicode and UTF-8 encoding issues in RStudio on our main Windows machine and getting Microsoft Linux Subsystem up and running so we can move back and forth between RStudio in both environments. Rather than build unicode points throughout our scripts, we decided to front load this work.
Thus our Code Repository debuts with two major scripts: one that transliterates all Greek unicode characters into ASCII English letter equivalents; and another that loads both Greek and English UTF-8 txt files, then quickly and cleanly parses six vectors for use in deep Computational Linguistics analysis (whole, lemma, and morphology for both languages). With the in-book datasets and code, experts and novices in Gospel Computational Linguistics can start to evaluate and build on our research. Our Data Visualizations section (freshly reformatted to tabloid layout) also features a new section that builds on this: Top Ten Words tables and graphs for the Harnack, Roth, and CENP datasets.
Identification of an additional 20 signature features showing statistically significant variance between Lk1/GMarc and Lk2 that will be used in future proofs of the Schwegler hypothesis and our five hypotheses. These now include several features with disproportionately high frequencies in Lk1/GMarc compared to Lk2, not just vice versa. Many of these newly listed features are morphologically nuanced bigrams, trigrams, and quadigrams we’ve been identifying over the past several editions of our LODLIB in DD 1.2.
Forked three sections (Computational Linguistics and the Synoptic [Signals] Problem; Data Visualizations; Excursus on Related Topics) from other areas to have their own sections.
Hundreds more “clear” vocal signal tags are now assigned across any and all strata throughout the entire reconstruction in anticipation of the future compilation of NLP training datasets for each vocal stratum.
Dozens of new entries to the Data Dictionary, adding further clarification and disambiguation of the Qn, Lk1, and Lk2 vocal strata.