Skip to content

Commit 8da1d99

Browse files
committed
New SCEP 106.
1 parent 5ca6bc7 commit 8da1d99

File tree

2 files changed

+351
-2
lines changed

2 files changed

+351
-2
lines changed

src/content/pages/sceps/scep0000.rst

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ history of the document is maintained in git [2]_.
2525
Final and Active Process SCEPs
2626
------------------------------
2727

28+
.. class:: table
29+
2830
====== ===================
2931
Num Title
3032
====== ===================
@@ -39,6 +41,8 @@ Draft SCEPs
3941

4042
The following SCEPs are under consideration for standardization.
4143

44+
.. class:: table
45+
4246
====== ===================
4347
Num Title
4448
====== ===================
@@ -48,11 +52,10 @@ Num Title
4852
|103| Standard filesystem representation method
4953
|104| Document handles and citation formats
5054
|105| Standard JSON representation method
55+
|106| Document formats suitable for "source" documents
5156
====== ===================
5257

5358

54-
55-
5659
References
5760
----------
5861

@@ -68,3 +71,4 @@ References
6871
.. |103| replace:: :raw-html:`<a href="scep0103.html">103</a>`
6972
.. |104| replace:: :raw-html:`<a href="scep0104.html">104</a>`
7073
.. |105| replace:: :raw-html:`<a href="scep0105.html">105</a>`
74+
.. |106| replace:: :raw-html:`<a href="scep0106.html">106</a>`
Lines changed: 345 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,345 @@
1+
:SCEP: 106
2+
:Title: Document formats suitable for "source" documents
3+
:Author: Raphael ‘kena’ Poss
4+
:Status: Draft
5+
:Type: Informational
6+
:Created: 2014-06-23
7+
8+
Introduction
9+
============
10+
11+
The Structured Common model [#SCEP-100]_ is highly dependent on a
12+
consensus by authors and readers about what constitutes the "source"
13+
of a published document: the object fingerprint [#SCEP-101]_ used for
14+
inter-document citations should identify the "essence" of a scientific
15+
work, as independent as possible from its representation in various
16+
formats.
17+
18+
This SCEP provides **guidelines and rationales** for users of written
19+
documents, in particular scholarly authors, to **choose source formats
20+
according to their compatibility with the Structured Commons vision**
21+
and other requirements.
22+
23+
Summary
24+
-------
25+
26+
The content of the following sections can be summarized as follows:
27+
28+
- **prefer document sources when computing fingerprints** and citing
29+
new or existing works;
30+
- **publish document sources** (eg. TeX) alongside their presentation
31+
formats (eg. HTML, PDF, EPUB) and **indicate clearly which source
32+
format is used**, eg. via file name extensions or user instructions;
33+
- do not under-estimate the importance of **long-term durability**, a
34+
requirement not commonly honored by popular word processing
35+
software;
36+
- acknowledge and do not under-estimate the recent (2005-2015) user demand
37+
for **markup languages that enable fast adoption, fast editing and
38+
fast reading in source form**, eg. rST_ or Markdown_.
39+
40+
This SCEP is only applicable to Structured Common objects that
41+
primarily consist of written text, ie. NOT data sets, images, program
42+
source code, program executables, virtual machine images, etc.
43+
44+
Source formats and citation network
45+
===================================
46+
47+
It is possible to integrate printable PDFs in the Structured
48+
Commons network directly; ie., compute fingerprints of PDF files
49+
directly and/or cite works via their PDF fingerprints.
50+
However, the Structured Commons model strongly encourages authors to
51+
*publish their document sources as well*.
52+
53+
This requirement is already prevalent in online document libraries,
54+
either from established academic publishers or in open repositories
55+
like arXiv [#ARXIV]_. Moreover, once authors take the habit to publish
56+
document sources alongside other presentation formats, it becomes
57+
possible to **make fingerprints independent from document
58+
representation**.
59+
60+
This in turn enables authors to (re-)generate alternate representations of
61+
a document after it has been published, without breaking the existing
62+
fingerprint-based citations from other works.
63+
64+
Support for multiple source formats
65+
===================================
66+
67+
There currently exist multiple workflows and tools used by scientific
68+
authors to prepare documents prior to publication. Anecdotically,
69+
this diversity is maintained and usually polarised by conflicting
70+
requirements between the authors' desire for a WYSIWYG editing
71+
interface and the field's requirement for high-quality print
72+
typesetting and long-term portability of document formats; the
73+
conflict is epitomized by this common question from graduate students
74+
worldwide: *"should I use Word or LaTeX to write my thesis?"*
75+
76+
For various reasons, some of which detailed below, this controversy
77+
may be soon resolved *for scientific works* by a common shift away
78+
from word processors, towards standard-based and document-centric
79+
workflows using multiple editing tools simultaneously--including but
80+
not limited LaTeX, and also newer "lightweight" markup formats like
81+
rST_ or Markdown_.
82+
83+
Nevertheless, this SCEP acknowledges that both technology and user
84+
preferences will continue to evolve over time, and thus that *the
85+
Structured Common model should not restrict users to a single source
86+
format or technology*.
87+
88+
89+
History of source document formats
90+
==================================
91+
92+
Historically, the following requirements have **motivated major
93+
technology shifts** by authors, ie. situations where authors willfully
94+
decided to adapt their workflow and working style and accept/adopt new
95+
tools and technology for source documents, even sometimes at the cost
96+
of a partial feature loss from their existing habits and expectations:
97+
98+
.. class:: table
99+
100+
.. list-table::
101+
:header-rows: 1
102+
:widths: 30 10 10 40 10
103+
104+
* - Requirement
105+
- Advent period
106+
- Origin
107+
- Historical motivation and shift
108+
- Casualties / compromises
109+
110+
* - **sep**: Ability to specify content and layout separately,
111+
to facilitate collaboration and reuse
112+
- 1960-1990
113+
- Authors
114+
- As authors started using personal computers and collaborating
115+
with peers using digital formats, implementers were forced to
116+
provide more features to enable separation of form and
117+
content, which in turn stimulated more and more new authors to
118+
learn and use these features from the get-go.
119+
- Reduced expectation/use of fine-grained, per-character control over typography and print.
120+
121+
* - **multi**: High-quality and high-fidelity support for multiple reading
122+
environments, in particular web and print
123+
- 1995-2005
124+
- Readers
125+
- This requirement from the advent of the World Wide Web forced
126+
authors to adopt tools with extensive support for *multiple
127+
output formats*, with output quality becoming a higher priority
128+
requirement when selecting editor programs than user interfaces.
129+
- Reduced expectation/use of WYSIWYG editing.
130+
131+
* - **long**: Long-term durability, ability to continue working
132+
with a document long after it was created, even after the
133+
original editor program has been obsoleted, updated, etc.
134+
- 2000-2010
135+
- Authors
136+
- This requirement emerged in the early 2000's as the majority of word processor users
137+
faced the realization that new software eventually drops compatibility with old
138+
documents over time. It stimulated the development and general adoption of
139+
*standard-based document languages* independent from the particular
140+
programs used to edit them.
141+
- Longer time between the definition of new editing features
142+
and general availability in authoring and reader software.
143+
144+
* - **reflow**: Ability for readers/viewers to recompute a
145+
presentation layout without access to the author's editing
146+
environment
147+
- 2000-2010
148+
- Readers
149+
- This requirement from users of portable document readers and
150+
smart phones stimulated acceptance of *source delivery*,
151+
ie. of publication channels where readers/viewers have access
152+
to part of whole of the "source" document format and can
153+
recompute renderings, at will, using standards-based
154+
technology.
155+
- Reduced expectation/use of workflows
156+
where authors decide the final appearance of documents.
157+
158+
* - **trans**: Transparent/human-friendly source language that enables fast adoption, and fast reading
159+
and interpretation by humans without prior processing
160+
- 2005-2015
161+
- Authors and Readers
162+
- This requirement from users who mostly communicate online with peers using
163+
lightweight client interfaces (chat, web forms, mobile apps)
164+
stimulated the creation and adoption of markup languages where *the
165+
source definition of a document is also an adequate text-only
166+
rendering*, confortable to read and reuse in "simple" interfaces
167+
with limited or no support for formatting.
168+
- Steeper learning curve when authors start seeking more
169+
control over rendering than provided by the markup language.
170+
171+
Source formats vs. requirements
172+
===============================
173+
174+
The following table illustrates how technology has evolved to respond
175+
to the requirements stated above over time:
176+
177+
.. class:: table
178+
179+
+-----------------------------------------------------------------+----------------------------------------------------+
180+
| Edition environments / source formats | Features vs. Requirements |
181+
+-------------------+---------------------+-----------------------+------------+----------+------+----------+----------+
182+
| Group | Flavor | Examples | sep | multi | long | reflow | trans |
183+
+===================+=====================+=======================+============+==========+======+==========+==========+
184+
| Word processors | Print-oriented | Word_, LibreOffice_ | yes [#a]_ | no | no | no | no |
185+
| +---------------------+-----------------------+------------+----------+------+----------+----------+
186+
| | Online-oriented | Dreamweaver_, | yes [#a]_ | no | no | yes | no |
187+
| | | Wordpress_, `Google | | | | | |
188+
| | | docs`_ | | | | | |
189+
+-------------------+---------------------+-----------------------+------------+----------+------+----------+----------+
190+
| Markup languages | Print-oriented | Troff_, TeX_, LaTeX_ | yes | yes | yes | no [#b]_ | no |
191+
| +---------------------+-----------------------+------------+----------+------+----------+----------+
192+
| | Online-oriented | HTML_ | yes | no [#c]_ | yes | yes | no |
193+
| +---------------------+-----------------------+------------+----------+------+----------+----------+
194+
| | Hybrid, tag-based | Texinfo_, SGML_, | yes | yes | yes | yes | no |
195+
| | markup | `Docbook XML`_ | | | | | |
196+
| +---------------------+-----------------------+------------+----------+------+----------+----------+
197+
| | Hybrid, | rST_, Markdown_, | yes | yes | yes | yes | yes |
198+
| | punctuation and | `Wiki markup`_, | | | | | |
199+
| | layout-based markup | `Org-mode`_ | | | | | |
200+
+-------------------+---------------------+-----------------------+------------+----------+------+----------+----------+
201+
202+
.. _Word: http://en.wikipedia.org/wiki/Microsoft_Word
203+
.. _LibreOffice: http://en.wikipedia.org/wiki/LibreOffice
204+
.. _Dreamweaver: http://en.wikipedia.org/wiki/Adobe_Dreamweaver
205+
.. _Wordpress: http://en.wikipedia.org/wiki/WordPress
206+
.. _Google Docs: http://en.wikipedia.org/wiki/Google_Docs
207+
.. _Troff: http://en.wikipedia.org/wiki/Troff
208+
.. _TeX: http://en.wikipedia.org/wiki/TeX
209+
.. _LaTeX: http://en.wikipedia.org/wiki/LaTeX
210+
.. _HTML: http://en.wikipedia.org/wiki/HTML
211+
.. _Texinfo: http://en.wikipedia.org/wiki/Texinfo
212+
.. _SGML: http://en.wikipedia.org/wiki/SGML
213+
.. _Docbook XML: http://en.wikipedia.org/wiki/DocBook
214+
.. _rST: http://en.wikipedia.org/wiki/ReStructuredText
215+
.. _Markdown: http://en.wikipedia.org/wiki/Markdown
216+
.. _Wiki markup: http://en.wikipedia.org/wiki/Wiki_markup
217+
.. _Org-mode: http://en.wikipedia.org/wiki/Org-mode
218+
219+
At the time of this writing, word processors are coming out of fashion for scientific works
220+
in favor of markup languages, with LaTeX historically prevalent in
221+
mathematics, logics and computer science.
222+
223+
LaTeX vs. other markup languages
224+
================================
225+
226+
LaTeX is commonly advertised to new scientific scholars as the go-to
227+
markup language suitable for academic publishing. LaTeX particularly
228+
contrasts with most word processing software with its long history of
229+
technical stability, reliability and typeset output quality, and these
230+
differences is commonly used as "selling point".
231+
232+
However, all users, including new authors, teachers of LaTeX and
233+
existing LaTeX users, should consider how LaTeX may not fully cater for
234+
recent requirements from both authors and readers:
235+
236+
- **client-side interpretation**: LaTeX still has only limited support
237+
for web and e-book publishing; in particular, its underlying TeX
238+
engine is designed to position words on a page, not organize text in
239+
semantic groups suitable for re-formatting in different ways by
240+
different readers.
241+
- **learning curve**: LaTeX presents an extremely steep learning curve
242+
to new authors, which opposes a significant threshold to adoption.
243+
- **lightweight implementation**: LaTeX requires access to a working
244+
LaTeX typesetting infrastructure, including a relatively large
245+
software and data base (hundreds of megabytes), to "interpret"
246+
source documents to a format understandable by humans.
247+
248+
In contrast, the new generation of "lightweight markup formats" pionereed
249+
by Wikipedia (`Wiki markup`_), Web fora (Markdown_) and inline
250+
source code documentation (rST_) is tailored to these new requirements
251+
without sacrificing the other advantages of LaTeX compared to word processors.
252+
253+
In short, this SCEP recommends scientific authors to **consider
254+
alternate source markup languages** for new works, tailored to contemporary
255+
user expectations, without sacrificing the Structured Commons vision:
256+
**long-term document durability**.
257+
258+
Choice of markup languages
259+
==========================
260+
261+
This SCEP recommends the following **prioritization of criteria** when
262+
considering multiple candidate markup languages for a new Structured
263+
Commons documents, in decreasing priority order:
264+
265+
1. **standardisation**: how well-specified is the markup language,
266+
how many different implementations exist that have
267+
a common interpretation of the markup language, and
268+
how likely will it be possible to re-implement
269+
tools from format specifications long after current
270+
implementations have been lost.
271+
2. **semantic transparency**: how much does the markup syntax
272+
suggest the semantic role of annotated content elements.
273+
3. **readability in source form**: how much can still be learn and
274+
understood from a document source if all knowledge about the format
275+
and document processing machinery has been lost.
276+
277+
Criterion #1 promotes all standard-based workflows and formats
278+
(eg. LaTeX_, rST_, Markdown_, HTML_, etc) over implementation-based workflows
279+
and formats (eg. OOXML, OXF, etc.), because program-centric
280+
environments have only poorly/partially standardized interchange
281+
formats, and it is thus unlikely that documents can be recovered from
282+
sources after current implementations fall out of use.
283+
284+
Criterion #2 promotes pre-structured markup languages like LaTeX_, rST_,
285+
Markdown_ or HTML_ compared to general markup languages like XML, where
286+
markup tags can be inscrutable without access to an externally provided
287+
schema, or print-oriented typesetting languages like Troff_,
288+
where markup tags specify layout and typography instead of semantics.
289+
290+
Criterion #3 promotes "transparent" markup languages like rST_,
291+
Markdown_ or `Org-mode`_, where the source form of a document is usually also
292+
conveniently readable, compared to command-based or tag-based
293+
languages like LaTeX_, texinfo_ or HTML_ which require
294+
preprocessing/interpretation to become conveniently readable.
295+
296+
Other criteria to further discriminate between alternatives are
297+
intendedly not covered by this SCEP, in order to:
298+
299+
1. acknowledge possibly diverging preferences by research field or
300+
user community.
301+
2. acknowledge the evolution of markup languages and technology over
302+
time, in particular regarding their support for "specialized
303+
features" (eg. inline mathematical formulas), integration as inline
304+
comments in programming languages, support by popular editor
305+
programs, etc.
306+
307+
References
308+
==========
309+
310+
.. [#SCEP-100] SCEP 100. "Structured Commons Model Overview"
311+
(http://www.structured-commons.org/scep0100.html)
312+
313+
.. [#SCEP-101] SCEP 101. "Structured Commons Object Model and Fingerprints".
314+
(http://www.structured-commons.org/scep0101.html)
315+
316+
.. [#ARXIV] ArXiv.org: "Why Submit the TeX/LaTeX Source?"
317+
(http://arxiv.org/help/faq/whytex)
318+
319+
.. [#a] Support for separation of content and presentation is present
320+
but is usually opt-in by authors.
321+
322+
.. [#b] Support for client-side reflowing is partially available via
323+
conversion to another markup language, typically HTML, but the
324+
conversion tools may not support all the markup used by
325+
authors.
326+
327+
.. [#c] Implementations focus on rendering by web browsers; alternate
328+
styling/presentation for print or e-book readers is possible
329+
but rarely or only partially supported by tools.
330+
331+
332+
Copyright
333+
=========
334+
335+
This document has been placed in the public domain.
336+
337+
338+
..
339+
Local Variables:
340+
mode: rst
341+
indent-tabs-mode: nil
342+
sentence-end-double-space: t
343+
fill-column: 70
344+
coding: utf-8
345+
End:

0 commit comments

Comments
 (0)