Hidehiko Nakanishi 1 Toshiyuki Naganawa 2 Soichi Tokizane 3 Tsuyoshi Yamamoto 1 1 Nakanishi Printing Co Ltd Kyoto Japan 2 Antenna House Inc Tokyo Japan 3 ID: 204433
Download Presentation The PPT/PDF document "Creating JATS XML from Japanese language..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Creating JATS XML from Japanese language articles and automatic typesetting using XSLT.
Hidehiko
Nakanishi
1
Toshiyuki Naganawa
2
Soichi
Tokizane
3
Tsuyoshi
Yamamoto
1
1
Nakanishi
Printing Co
., Ltd
. Kyoto Japan
2
Antenna
House Inc. Tokyo Japan
3
The University
of Tokyo Slide2
ContentsIntroduction
Creating
Japanese XML articles in
JATS
Creating PDF using AH
Formatter
Challenges of Applying JATS to Japanese language
texts
Future
ConclusionSlide3
IntroductionSlide4
Many countries use Non-Latin scriptSlide5
Not all research articles are written in English.Many articles are not even using Latin alphabetsSlide6
What languages are used in articles written in Japan?
Articles published in J-Stage,
E-journal
platform operated by the Japan Science and Technology
Agency (
JST
).
University journal articles indexed in
NDL-OPAC
,
All areasSlide7
We wanted schema applicable to JapaneseEven for Japanese-language articles, e-articles are essential.
We were looking for schema for Japanese-language articles.
Such schema had to accept English as well. Slide8
JATS multi-language supportIn 2011 JATS 0.4 enabled to express Japanese-language articles in XML
J-STAGE supported JATS 0.4 immediately
We started creating JATS XML for Japanese-language articles
Before that
Slide9
I am from Kyoto, Japan
Bethesda
Kyoto
East Asia Kanji
c
ultural zoneSlide10
Kyoto was a former capital
Where my company, Nakanishi Printing, is located.Slide11
Founded in 1865 by our ancestor.
150 year old family
business.
One of the oldest printers.
Former
building of Nakanishi Printing in Taisho era (1912-1926)
Current building of Nakanishi printing
Our TraditionSlide12
A brazier made by Woodcut print plate in 19c
Type
picker
1960’s
Our history
TodaySlide13
This is a Japanese e-journal
The Japanese Journal of Gastroenterological Surgery Slide14
Same page expressed in EnglishSlide15Slide16
Expressing Multiple LanguagesAlternate expressions for a single object are necessarySimple repetition of a tag can be confusing
Two name expressions of the same person?
Or two different persons?
JATS
introduced “alternatives” tags for such casesSlide17
Two name expressions of a single person
<name-alternatives>
<
name name-style="eastern"
xml:lang
="ja-
Jpan
">
<surname>
中西
</surname>
<given-name>
秀彦
</given-name>
</name>
<
name name-style="western"
xml;lang
="
en
">
<surname>
Nakanishi
</surname>
<given-name>
Hidehiko
</given-name>
</name
>
</name-alternatives
>
“Alternatives” Tags Slide18
“Alternatives” tagsSlide19
element name
multi-language tag
Note
article title
<trans-title>
article subtitle
<trans-subtitle>
names
<name-alternatives>
affiliations
<
aff
-alternatives>
collaborators
<
collab
-alternatives>
abstract
<abstract>
<abstract> is repeatable with different "
xml:lang
".
<trans-abstract> is for articles later translated.
keyword group
<kwd-group>
<
kwd
-group> is repeatable with different "
xml:lang
".
generic
<alternatives>
any component which need multi-language data
How multiple language can be expressed in
JATSSlide20
Creating Japanese XML articles in JATSSlide21
Creating XML articles in JATSWe don’t have tools readily available for creating Japanese XML files.
Our method
Convert
Microsoft Word to Microsoft Office Open
XML
Convert
Microsoft Office Open XML to JATS
XML
Validate
XMLSlide22
(1) Converting Microsoft Word to Microsoft Office Open XML
MS Open
XML tags Slide23
(2) Converting Microsoft Office Open XML to JATS XML Through XSLT
,
removing
unnecessary
tags.
Perl program processing.
We faced the d
ifficulty
of Agglutinative
languages
A word connect next word without space.
Computer cannot distinguish word separation.
Even in given name and surname separation. Slide24
Agglutinative languagesTypical in East Asian languagesNo separating spaces between wordsSlide25
One sentence one character stringJapanese
Agglutinative
languages using
Ideograph
日本語
表意文字を用いた膠着語Slide26
Agglutinative languagesIn old days, even no punctuations were used i.e
. multiple sentences in one character string!Slide27
Inserting word separators. we insert separators manually.
surname
,
"
中西
",
given
name,
"
秀彦
", are attached as "
中西秀彦
" in an article It is separated as "中西@秀彦"
Possible alternatives are "中@西秀彦", and "中西
秀@彦", but only human can eliminate themThere is no algorithm to determine it correctly. Slide28
(3) Validating XMLUse the Oxygen XML editor
Final
JATS XML is
obtained to be uploaded to J-STAGESlide29
PDF is still necessary
For paper publishing.
For readability.Slide30Slide31
Creating PDF using AH FormatterSlide32
Antenna House
AH Formatter Slide33
XSLTThe XSLT converts a JATS file into XSL-FO which expresses page model format for PDF. Slide34
For Japanese rendering
AH Formatter extension
Slide35
Using Formatter for STM articles
There are no major problems
The
basic style of writing STM papers do not differ greatly between western countries and
Japan.
Word separators should be inserted in XML in advance
Slide36
Challenges of Applying JATS to Japanese language textsBut in Japan,
exquisite type settings are
requested.
Automatic
type
setting by
AH
formatter may not be sufficient.Slide37
Avoiding Line-Top PunctuationsPunctuation marks shall not come at the top of a line
⇒
Also in English
「
っ
」
or
「
ッ」
(to mark a geminate consonant)
does not come in a head of a
line ⇒ Japanese ruleAH Formatter can handle these rules Slide38
Avoiding Word Breakup
Some words, such as personal names shall not be broken-up between lines
We
use
"Zero Width Joiner" code (‍)
e.g.
中
&#
x200D
;
西Slide39
Positioning Figures/TablesFigures and tables should be positioned in the SAME page that the corresponding texts appear.This requires customized
XSLT
, sometimes for each figures and tables.
This increases cost.Slide40
Positioning Figures/Tables
Every articles need these XSLTsSlide41
FutureWhat is to be done nextVertical writing
Emphasis or “
Kenten
”
WarichuSlide42
Vertical writingTraditionally, Japanese (and Chinese and Korean) writes from top to bottomSlide43
Vertical WritingVertical Writing causes some interesting problems, orientation of Arabic numerals and Latin alphabets
New element for direction is necessary
.
as <
writing-direction="
vertical">Slide44
EmphasisEmphasis or “Kenten
”
It is like bold faces and italics in English
We use <styled-content> and AH formatter extension to express this today.
We need a generic tag, <emphasis>Slide45
WarichuVertical writing texts sometimes contain notes called “
Warichu
”.
Warichu
uses 2 lines within a parent line.Slide46
Warichu
Historical document exampleSlide47
SuggestionAdditional tags forVertical writingEmphasis or “
Kenten
”
WarichuSlide48
ConclusionJATS opened a new horizon in processing Japanese-language articlesNo major difficulties
UTF-8, encoding for XML, also enables to express most Japanese characters correctlySlide49
ConclusionStill there are remaining issues in processing non-Latin, agglutinative languages such as Japanese.
Challenges
Word separators have to be inserted manually
Line break issues
Positioning figures and tables correctlySlide50
Heaven/Earth/Man
http://
artnews.blog.so-net.ne.jp
/2011-04-22Slide51
Structure vs. ExpressionIn pictograph/ideograph writing system, authors and publishers care more about the look appearance and the layout, than those in western world.Calligraphy
We sometimes need to describe such looks/layouts in XML.
May, or may not be solved by extending
JATSSlide52
Is JATS applicable?
“
Kaitai
shinsho
” the first western medical book translation in 1774. Slide53
Is JATS applicable?
“
Amma
tebiki
”
Eastern medical text book(1835)Slide54
Thank you