| 
 | 1 | +:SCEP: 106  | 
 | 2 | +:Title: Document formats suitable for "source" documents  | 
 | 3 | +:Author: Raphael ‘kena’ Poss  | 
 | 4 | +:Status: Draft  | 
 | 5 | +:Type: Informational  | 
 | 6 | +:Created: 2014-06-23  | 
 | 7 | + | 
 | 8 | +Introduction  | 
 | 9 | +============  | 
 | 10 | + | 
 | 11 | +The Structured Common model [#SCEP-100]_ is highly dependent on a  | 
 | 12 | +consensus by authors and readers about what constitutes the "source"  | 
 | 13 | +of a published document: the object fingerprint [#SCEP-101]_ used for  | 
 | 14 | +inter-document citations should identify the "essence" of a scientific  | 
 | 15 | +work, as independent as possible from its representation in various  | 
 | 16 | +formats.  | 
 | 17 | + | 
 | 18 | +This SCEP provides **guidelines and rationales** for users of written  | 
 | 19 | +documents, in particular scholarly authors, to **choose source formats  | 
 | 20 | +according to their compatibility with the Structured Commons vision**  | 
 | 21 | +and other requirements.  | 
 | 22 | + | 
 | 23 | +Summary  | 
 | 24 | +-------  | 
 | 25 | + | 
 | 26 | +The content of the following sections can be summarized as follows:  | 
 | 27 | + | 
 | 28 | +- **prefer document sources when computing fingerprints** and citing  | 
 | 29 | +  new or existing works;  | 
 | 30 | +- **publish document sources** (eg. TeX) alongside their presentation  | 
 | 31 | +  formats (eg. HTML, PDF, EPUB) and **indicate clearly which source  | 
 | 32 | +  format is used**, eg. via file name extensions or user instructions;  | 
 | 33 | +- do not under-estimate the importance of **long-term durability**, a  | 
 | 34 | +  requirement not commonly honored by popular word processing  | 
 | 35 | +  software;  | 
 | 36 | +- acknowledge and do not under-estimate the recent (2005-2015) user demand  | 
 | 37 | +  for **markup languages that enable fast adoption, fast editing and  | 
 | 38 | +  fast reading in source form**, eg. rST_ or Markdown_.  | 
 | 39 | + | 
 | 40 | +This SCEP is only applicable to Structured Common objects that  | 
 | 41 | +primarily consist of written text, ie. NOT data sets, images, program  | 
 | 42 | +source code, program executables, virtual machine images, etc.  | 
 | 43 | + | 
 | 44 | +Source formats and citation network  | 
 | 45 | +===================================  | 
 | 46 | + | 
 | 47 | +It is possible to integrate printable PDFs in the Structured  | 
 | 48 | +Commons network directly; ie., compute fingerprints of PDF files  | 
 | 49 | +directly and/or cite works via their PDF fingerprints.  | 
 | 50 | +However, the Structured Commons model strongly encourages authors to  | 
 | 51 | +*publish their document sources as well*.  | 
 | 52 | + | 
 | 53 | +This requirement is already prevalent in online document libraries,  | 
 | 54 | +either from established academic publishers or in open repositories  | 
 | 55 | +like arXiv [#ARXIV]_.  Moreover, once authors take the habit to publish  | 
 | 56 | +document sources alongside other presentation formats, it becomes  | 
 | 57 | +possible to **make fingerprints independent from document  | 
 | 58 | +representation**.  | 
 | 59 | + | 
 | 60 | +This in turn enables authors to (re-)generate alternate representations of  | 
 | 61 | +a document after it has been published, without breaking the existing  | 
 | 62 | +fingerprint-based citations from other works.  | 
 | 63 | + | 
 | 64 | +Support for multiple source formats  | 
 | 65 | +===================================  | 
 | 66 | + | 
 | 67 | +There currently exist multiple workflows and tools used by scientific  | 
 | 68 | +authors to prepare documents prior to publication.  Anecdotically,  | 
 | 69 | +this diversity is maintained and usually polarised by conflicting  | 
 | 70 | +requirements between the authors' desire for a WYSIWYG editing  | 
 | 71 | +interface and the field's requirement for high-quality print  | 
 | 72 | +typesetting and long-term portability of document formats; the  | 
 | 73 | +conflict is epitomized by this common question from graduate students  | 
 | 74 | +worldwide: *"should I use Word or LaTeX to write my thesis?"*  | 
 | 75 | + | 
 | 76 | +For various reasons, some of which detailed below, this controversy  | 
 | 77 | +may be soon resolved *for scientific works* by a common shift away  | 
 | 78 | +from word processors, towards standard-based and document-centric  | 
 | 79 | +workflows using multiple editing tools simultaneously--including but  | 
 | 80 | +not limited LaTeX, and also newer "lightweight" markup formats like  | 
 | 81 | +rST_ or Markdown_.  | 
 | 82 | + | 
 | 83 | +Nevertheless, this SCEP acknowledges that both technology and user  | 
 | 84 | +preferences will continue to evolve over time, and thus that *the  | 
 | 85 | +Structured Common model should not restrict users to a single source  | 
 | 86 | +format or technology*.  | 
 | 87 | + | 
 | 88 | + | 
 | 89 | +History of source document formats  | 
 | 90 | +==================================  | 
 | 91 | + | 
 | 92 | +Historically, the following requirements have **motivated major  | 
 | 93 | +technology shifts** by authors, ie. situations where authors willfully  | 
 | 94 | +decided to adapt their workflow and working style and accept/adopt new  | 
 | 95 | +tools and technology for source documents, even sometimes at the cost  | 
 | 96 | +of a partial feature loss from their existing habits and expectations:  | 
 | 97 | + | 
 | 98 | +.. class:: table  | 
 | 99 | + | 
 | 100 | +.. list-table::  | 
 | 101 | +   :header-rows: 1  | 
 | 102 | +   :widths: 30 10 10 40 10  | 
 | 103 | + | 
 | 104 | +   * - Requirement  | 
 | 105 | +     - Advent period  | 
 | 106 | +     - Origin  | 
 | 107 | +     - Historical motivation and shift  | 
 | 108 | +     - Casualties / compromises  | 
 | 109 | + | 
 | 110 | +   * - **sep**: Ability to specify content and layout separately,  | 
 | 111 | +       to facilitate collaboration and reuse  | 
 | 112 | +     - 1960-1990  | 
 | 113 | +     - Authors  | 
 | 114 | +     - As authors started using personal computers and collaborating  | 
 | 115 | +       with peers using digital formats, implementers were forced to  | 
 | 116 | +       provide more features to enable separation of form and  | 
 | 117 | +       content, which in turn stimulated more and more new authors to  | 
 | 118 | +       learn and use these features from the get-go.  | 
 | 119 | +     - Reduced expectation/use of fine-grained, per-character control over typography and print.  | 
 | 120 | + | 
 | 121 | +   * - **multi**: High-quality and high-fidelity support for multiple reading  | 
 | 122 | +       environments, in particular web and print  | 
 | 123 | +     - 1995-2005  | 
 | 124 | +     - Readers  | 
 | 125 | +     - This requirement from the advent of the World Wide Web forced  | 
 | 126 | +       authors to adopt tools with extensive support for *multiple  | 
 | 127 | +       output formats*, with output quality becoming a higher priority  | 
 | 128 | +       requirement when selecting editor programs than user interfaces.  | 
 | 129 | +     - Reduced expectation/use of WYSIWYG editing.  | 
 | 130 | + | 
 | 131 | +   * - **long**: Long-term durability, ability to continue working  | 
 | 132 | +       with a document long after it was created, even after the  | 
 | 133 | +       original editor program has been obsoleted, updated, etc.  | 
 | 134 | +     - 2000-2010  | 
 | 135 | +     - Authors  | 
 | 136 | +     - This requirement emerged in the early 2000's as the majority of word processor users  | 
 | 137 | +       faced the realization that new software eventually drops compatibility with old  | 
 | 138 | +       documents over time. It stimulated the development and general adoption of  | 
 | 139 | +       *standard-based document languages* independent from the particular  | 
 | 140 | +       programs used to edit them.  | 
 | 141 | +     - Longer time between the definition of new editing features  | 
 | 142 | +       and general availability in authoring and reader software.  | 
 | 143 | + | 
 | 144 | +   * - **reflow**: Ability for readers/viewers to recompute a  | 
 | 145 | +       presentation layout without access to the author's editing  | 
 | 146 | +       environment  | 
 | 147 | +     - 2000-2010  | 
 | 148 | +     - Readers  | 
 | 149 | +     - This requirement from users of portable document readers and  | 
 | 150 | +       smart phones stimulated acceptance of *source delivery*,  | 
 | 151 | +       ie. of publication channels where readers/viewers have access  | 
 | 152 | +       to part of whole of the "source" document format and can  | 
 | 153 | +       recompute renderings, at will, using standards-based  | 
 | 154 | +       technology.  | 
 | 155 | +     - Reduced expectation/use of workflows  | 
 | 156 | +       where authors decide the final appearance of documents.  | 
 | 157 | + | 
 | 158 | +   * - **trans**: Transparent/human-friendly source language that enables fast adoption, and fast reading  | 
 | 159 | +       and interpretation by humans without prior processing  | 
 | 160 | +     - 2005-2015  | 
 | 161 | +     - Authors and Readers  | 
 | 162 | +     - This requirement from users who mostly communicate online with peers using  | 
 | 163 | +       lightweight client interfaces (chat, web forms, mobile apps)  | 
 | 164 | +       stimulated the creation and adoption of markup languages where *the  | 
 | 165 | +       source definition of a document is also an adequate text-only  | 
 | 166 | +       rendering*, confortable to read and reuse in "simple" interfaces  | 
 | 167 | +       with limited or no support for formatting.  | 
 | 168 | +     - Steeper learning curve when authors start seeking more  | 
 | 169 | +       control over rendering than provided by the markup language.  | 
 | 170 | + | 
 | 171 | +Source formats vs. requirements  | 
 | 172 | +===============================  | 
 | 173 | + | 
 | 174 | +The following table illustrates how technology has evolved to respond  | 
 | 175 | +to the requirements stated above over time:  | 
 | 176 | + | 
 | 177 | +.. class:: table  | 
 | 178 | + | 
 | 179 | ++-----------------------------------------------------------------+----------------------------------------------------+  | 
 | 180 | +| Edition environments / source formats                           | Features vs. Requirements                          |  | 
 | 181 | ++-------------------+---------------------+-----------------------+------------+----------+------+----------+----------+  | 
 | 182 | +| Group             | Flavor              | Examples              |    sep     |  multi   | long |  reflow  |  trans   |  | 
 | 183 | ++===================+=====================+=======================+============+==========+======+==========+==========+  | 
 | 184 | +| Word processors   | Print-oriented      | Word_, LibreOffice_   | yes [#a]_  | no       | no   | no       | no       |  | 
 | 185 | +|                   +---------------------+-----------------------+------------+----------+------+----------+----------+  | 
 | 186 | +|                   | Online-oriented     | Dreamweaver_,         | yes [#a]_  | no       | no   | yes      | no       |  | 
 | 187 | +|                   |                     | Wordpress_, `Google   |            |          |      |          |          |  | 
 | 188 | +|                   |                     | docs`_                |            |          |      |          |          |  | 
 | 189 | ++-------------------+---------------------+-----------------------+------------+----------+------+----------+----------+  | 
 | 190 | +| Markup languages  | Print-oriented      | Troff_, TeX_, LaTeX_  | yes        | yes      | yes  | no [#b]_ | no       |  | 
 | 191 | +|                   +---------------------+-----------------------+------------+----------+------+----------+----------+  | 
 | 192 | +|                   | Online-oriented     | HTML_                 | yes        | no [#c]_ | yes  | yes      | no       |  | 
 | 193 | +|                   +---------------------+-----------------------+------------+----------+------+----------+----------+  | 
 | 194 | +|                   | Hybrid, tag-based   | Texinfo_, SGML_,      | yes        | yes      | yes  | yes      | no       |  | 
 | 195 | +|                   | markup              | `Docbook XML`_        |            |          |      |          |          |  | 
 | 196 | +|                   +---------------------+-----------------------+------------+----------+------+----------+----------+  | 
 | 197 | +|                   | Hybrid,             | rST_, Markdown_,      | yes        | yes      | yes  | yes      | yes      |  | 
 | 198 | +|                   | punctuation and     | `Wiki markup`_,       |            |          |      |          |          |  | 
 | 199 | +|                   | layout-based markup | `Org-mode`_           |            |          |      |          |          |  | 
 | 200 | ++-------------------+---------------------+-----------------------+------------+----------+------+----------+----------+  | 
 | 201 | + | 
 | 202 | +.. _Word: http://en.wikipedia.org/wiki/Microsoft_Word  | 
 | 203 | +.. _LibreOffice: http://en.wikipedia.org/wiki/LibreOffice  | 
 | 204 | +.. _Dreamweaver: http://en.wikipedia.org/wiki/Adobe_Dreamweaver  | 
 | 205 | +.. _Wordpress: http://en.wikipedia.org/wiki/WordPress  | 
 | 206 | +.. _Google Docs: http://en.wikipedia.org/wiki/Google_Docs  | 
 | 207 | +.. _Troff: http://en.wikipedia.org/wiki/Troff  | 
 | 208 | +.. _TeX: http://en.wikipedia.org/wiki/TeX  | 
 | 209 | +.. _LaTeX: http://en.wikipedia.org/wiki/LaTeX  | 
 | 210 | +.. _HTML: http://en.wikipedia.org/wiki/HTML  | 
 | 211 | +.. _Texinfo: http://en.wikipedia.org/wiki/Texinfo  | 
 | 212 | +.. _SGML: http://en.wikipedia.org/wiki/SGML  | 
 | 213 | +.. _Docbook XML: http://en.wikipedia.org/wiki/DocBook  | 
 | 214 | +.. _rST: http://en.wikipedia.org/wiki/ReStructuredText  | 
 | 215 | +.. _Markdown: http://en.wikipedia.org/wiki/Markdown  | 
 | 216 | +.. _Wiki markup: http://en.wikipedia.org/wiki/Wiki_markup  | 
 | 217 | +.. _Org-mode: http://en.wikipedia.org/wiki/Org-mode  | 
 | 218 | + | 
 | 219 | +At the time of this writing, word processors are coming out of fashion for scientific works  | 
 | 220 | +in favor of markup languages, with LaTeX historically prevalent in  | 
 | 221 | +mathematics, logics and computer science.  | 
 | 222 | + | 
 | 223 | +LaTeX vs. other markup languages  | 
 | 224 | +================================  | 
 | 225 | + | 
 | 226 | +LaTeX is commonly advertised to new scientific scholars as the go-to  | 
 | 227 | +markup language suitable for academic publishing. LaTeX particularly  | 
 | 228 | +contrasts with most word processing software with its long history of  | 
 | 229 | +technical stability, reliability and typeset output quality, and these  | 
 | 230 | +differences is commonly used as "selling point".  | 
 | 231 | + | 
 | 232 | +However, all users, including new authors, teachers of LaTeX and  | 
 | 233 | +existing LaTeX users, should consider how LaTeX may not fully cater for  | 
 | 234 | +recent requirements from both authors and readers:  | 
 | 235 | + | 
 | 236 | +- **client-side interpretation**: LaTeX still has only limited support  | 
 | 237 | +  for web and e-book publishing; in particular, its underlying TeX  | 
 | 238 | +  engine is designed to position words on a page, not organize text in  | 
 | 239 | +  semantic groups suitable for re-formatting in different ways by  | 
 | 240 | +  different readers.  | 
 | 241 | +- **learning curve**: LaTeX presents an extremely steep learning curve  | 
 | 242 | +  to new authors, which opposes a significant threshold to adoption.  | 
 | 243 | +- **lightweight implementation**: LaTeX requires access to a working  | 
 | 244 | +  LaTeX typesetting infrastructure, including a relatively large  | 
 | 245 | +  software and data base (hundreds of megabytes), to "interpret"  | 
 | 246 | +  source documents to a format understandable by humans.  | 
 | 247 | + | 
 | 248 | +In contrast, the new generation of "lightweight markup formats" pionereed  | 
 | 249 | +by Wikipedia (`Wiki markup`_), Web fora (Markdown_) and inline  | 
 | 250 | +source code documentation (rST_) is tailored to these new requirements  | 
 | 251 | +without sacrificing the other advantages of LaTeX compared to word processors.  | 
 | 252 | + | 
 | 253 | +In short, this SCEP recommends scientific authors to **consider  | 
 | 254 | +alternate source markup languages** for new works, tailored to contemporary  | 
 | 255 | +user expectations, without sacrificing the Structured Commons vision:  | 
 | 256 | +**long-term document durability**.  | 
 | 257 | + | 
 | 258 | +Choice of markup languages  | 
 | 259 | +==========================  | 
 | 260 | + | 
 | 261 | +This SCEP recommends the following **prioritization of criteria** when  | 
 | 262 | +considering multiple candidate markup languages for a new Structured  | 
 | 263 | +Commons documents, in decreasing priority order:  | 
 | 264 | + | 
 | 265 | +1. **standardisation**: how well-specified is the markup language,  | 
 | 266 | +   how many different implementations exist that have  | 
 | 267 | +   a common interpretation of the markup language, and  | 
 | 268 | +   how likely will it be possible to re-implement  | 
 | 269 | +   tools from format specifications long after current  | 
 | 270 | +   implementations have been lost.  | 
 | 271 | +2. **semantic transparency**: how much does the markup syntax  | 
 | 272 | +   suggest the semantic role of annotated content elements.  | 
 | 273 | +3. **readability in source form**: how much can still be learn and  | 
 | 274 | +   understood from a document source if all knowledge about the format  | 
 | 275 | +   and document processing machinery has been lost.  | 
 | 276 | + | 
 | 277 | +Criterion #1 promotes all standard-based workflows and formats  | 
 | 278 | +(eg. LaTeX_, rST_, Markdown_, HTML_, etc) over implementation-based workflows  | 
 | 279 | +and formats (eg. OOXML, OXF, etc.), because program-centric  | 
 | 280 | +environments have only poorly/partially standardized interchange  | 
 | 281 | +formats, and it is thus unlikely that documents can be recovered from  | 
 | 282 | +sources after current implementations fall out of use.  | 
 | 283 | + | 
 | 284 | +Criterion #2 promotes pre-structured markup languages like LaTeX_, rST_,  | 
 | 285 | +Markdown_ or HTML_ compared to general markup languages like XML, where  | 
 | 286 | +markup tags can be inscrutable without access to an externally provided  | 
 | 287 | +schema, or print-oriented typesetting languages like Troff_,  | 
 | 288 | +where markup tags specify layout and typography instead of semantics.  | 
 | 289 | + | 
 | 290 | +Criterion #3 promotes "transparent" markup languages like rST_,  | 
 | 291 | +Markdown_ or `Org-mode`_, where the source form of a document is usually also  | 
 | 292 | +conveniently readable, compared to command-based or tag-based  | 
 | 293 | +languages like LaTeX_, texinfo_ or HTML_ which require  | 
 | 294 | +preprocessing/interpretation to become conveniently readable.  | 
 | 295 | + | 
 | 296 | +Other criteria to further discriminate between alternatives are  | 
 | 297 | +intendedly not covered by this SCEP, in order to:  | 
 | 298 | + | 
 | 299 | +1. acknowledge possibly diverging preferences by research field or  | 
 | 300 | +   user community.  | 
 | 301 | +2. acknowledge the evolution of markup languages and technology over  | 
 | 302 | +   time, in particular regarding their support for "specialized  | 
 | 303 | +   features" (eg. inline mathematical formulas), integration as inline  | 
 | 304 | +   comments in programming languages, support by popular editor  | 
 | 305 | +   programs, etc.  | 
 | 306 | + | 
 | 307 | +References  | 
 | 308 | +==========  | 
 | 309 | + | 
 | 310 | +.. [#SCEP-100] SCEP 100. "Structured Commons Model Overview"  | 
 | 311 | +   (http://www.structured-commons.org/scep0100.html)  | 
 | 312 | +
  | 
 | 313 | +.. [#SCEP-101] SCEP 101. "Structured Commons Object Model and Fingerprints".  | 
 | 314 | +   (http://www.structured-commons.org/scep0101.html)  | 
 | 315 | +
  | 
 | 316 | +.. [#ARXIV] ArXiv.org: "Why Submit the TeX/LaTeX Source?"  | 
 | 317 | +   (http://arxiv.org/help/faq/whytex)  | 
 | 318 | +
  | 
 | 319 | +.. [#a] Support for separation of content and presentation is present  | 
 | 320 | +   but is usually opt-in by authors.  | 
 | 321 | +
  | 
 | 322 | +.. [#b] Support for client-side reflowing is partially available via  | 
 | 323 | +   conversion to another markup language, typically HTML, but the  | 
 | 324 | +   conversion tools may not support all the markup used by  | 
 | 325 | +   authors.  | 
 | 326 | +
  | 
 | 327 | +.. [#c] Implementations focus on rendering by web browsers; alternate  | 
 | 328 | +   styling/presentation for print or e-book readers is possible  | 
 | 329 | +   but rarely or only partially supported by tools.  | 
 | 330 | +
  | 
 | 331 | +
  | 
 | 332 | +Copyright  | 
 | 333 | +=========  | 
 | 334 | + | 
 | 335 | +This document has been placed in the public domain.  | 
 | 336 | + | 
 | 337 | + | 
 | 338 | +..  | 
 | 339 | +   Local Variables:  | 
 | 340 | +   mode: rst  | 
 | 341 | +   indent-tabs-mode: nil  | 
 | 342 | +   sentence-end-double-space: t  | 
 | 343 | +   fill-column: 70  | 
 | 344 | +   coding: utf-8  | 
 | 345 | +   End:  | 
0 commit comments