GEDCOM 1:1 to XML  
  :: nn :: nn ::    
     


ged1212xml: GEDCOM one-to-one to XML

Abstract

Scripts (awk|wsh ~ Javascript for Windows Script Host) to convert GEDCOM-data – GEnealogical Data COMmunication – one-to-one to wellformatted XML; wellformed and valid alike the GEDCOM-source. Even very big GED-files should be convertible requiring only small system-resources. The target-format – defined as “GEDCOM 5.5 XML” elsewhere – enables validation against the GEDCOM 5.5 standard using XML-mechanisms and schemata (RelaxNG + Schematron).

CONTENTS [javascript:makeTOC()]
Links to other sites [2nd window] target the same one’n’only second window.

About / Motivation

The scripts (“ged1212xml”) provided on this page attempt a close (~100%) one-to-one conversion of GEDCOM-data into a very simple XML, according to a project by 2nd window Chad Albers called “GEDCOM 5.5 XML” (“gedcom55XML”) …

“… all GEDCOM tags are translated into XML elements; open and closed elements delimit the data; and the elements are nested in the same way prescribed by the GEDCOM specification.” (…)
GEDCOM 5.5 XML attempts to be a 100 percent one-to-one translation of GEDCOM 5.5 into XML; it even includes the superfluous (and empty) <TRLR/> element.” [¹]
GEDCOM 5.5 XML differentiates itself from GedML and GeniML because it attempts to replicate the LDS's GEDCOM 5.5 standard using XML markup. Without exception, all GEDCOM 5.5 tags should correspond to XML elements with the same name; all tags should be preserved; the parent-child relationships between the tags and elements should parallel one another; and all data delimited by the elements should fall within the strict guidelines of the standard.” [²]
[¹] Chad Albers at 2nd window neomantic.com/gedcom55XML
[²] cf. the 2nd window README notes on the competing GedML & GeniML for his gedcom55XML approach.
  • more semantics
  • more data-structures
  • more cross-references
  • more encodings
  • more extendable
  • more readable (source)
  • more free tools
  • more …, &c

Searching the web for GEDCOM & XML is rather disappointing nowadays, except for a very few sites, that are surprisingly up to date and still active on this topic. Most projects during the past “GEDCOM to XML”-hype made large promises when using all the XML capabilities: ›››

All true, all possible. But equally most projects remain drafts, and seem to have been abandoned. While even the task of a 1:1 translation isn’t really completed.

Installing/running “big” genealogical software (mostly shareware) to simply import GEDCOM (with hidden loss of data?) and export to some XML (more data loss?) is a fussy and risky way to get all already collected genealogical data available for further xml-processing. Filter-action (import to export) is not the task such software is made for. The background history of Tim Forsythe’s 2nd window GEDCOM Validator tells more of the whole story.

Technologies go by, data stay. Data + Format (as a generic markup, attaching computable semantic to data) represent the real hard efforts made by humans on research, structure and validity. They must be preserved under all circumstances of changing technology. Loss of data, de-structuring (makes data less read- and usable for humans and machines), and opaque formats are the worst faults. They result in data-cemeteries that have to be human touched and checked again and again. Content without open-standards markup is dead, uncomputable, until reanimated by human review.

So: why this fallback to a seemingly simplistic, half buried approach? To a project that Michael H. Kay began with 2nd window GedML as the pioneer he is in many respects, and that everyone everywhere refers to? What are the remaining benefits?

First of all it’s a simple format, easy to create, easy to control, and it’s not new. Efforts made before (XSLT stylesheets) may be reused with only minor changes. Roots to the established GEDCOM 5.5 standard – that XML was never successful to replace or become heir to – aren’t cut off. To get your hands on the genealogical data is straight forward, and the next step of processing can already be done with XML/XSL-tools, e.g. transforming it into more ambitious XML dialects. For this – in despite of unfortunately using an own intermediate XML-format – you may cf. Bill Kinnersley’s worth reading 2nd window GEDC documentation, his XML-based standard and application.

Not at least – see Chad Albers’ approach with RelaxNG/Schematron – structure and data-types of a GED-file can be validated (using XML-mechanisms) against the GEDCOM 5.5 standard, if they remain nearly unchanged. The scripts aim to be a possible replacement of the first step in his workflow (ongoing to XSL-FO formatting-objects and PDF).

Script-, GEDCOM-, and XML-Gurus: Interested in testing or even using this?
Please let me know about the good, the bad, and the ugly things. You are welcome.

Download

History

  1. [2008-09-18] – pre-release (testers)
  2. [2008-10-01] – initial release (public)
  3. [2008-10-11]
    • option added to differ slashed from tagged surname-parts by node-naming
    • XML-output additionally formatted with blank lines for easier “visual parsing”
    • GEDnoopp.awk added to archive to (un-)format GED-files likewise, as shown below
  4. [2008-11-20]
    • ged1212xml.rev.xsl XSLT-stylesheet added for rudimentary reverse transformation

Archive Contents

ged1212xml.awkged1212xml.awk.htm
awk-script to translate (hopefully) any GEDCOM file one-to-one to XML. It is “stand alone”, i.e. ANSEL-to-Entity routines are already included.
ANSELentify.awk
ANSEL-to-Entity as an awk-script of its own. Do not run before ged1212xml! All the entities included would be deactivated through an ampersand-translation to &amp;, and that’s not what intended.
ANSELentify.sed
ANSEL-to Entity as sed-script. Just another offspring for the “Stream EDitor”.
ged1212xml.wsfged1212xml.wsf.htm
Javascript to be run in the “Windows Script Host” (WSH) engine. Varying from above, this is not “stand alone”, but imports (i.e. requires) ANSELentify.js at runtime.
ANSELentify.js
ANSEL-to-Entity code imported + executed by ged1212xml.wsf.
GEDnoopp.awk
GEDCOM normalize or pretty print” – format a GED-file with indents and blank lines (visually group records) in a first run; “normalize” (remove, undo) the formattings according to standard in a second run to restore its validity.
ged1212xml.rev.xslged1212xml.rev.xsl.htm
Reverse Transformation Stylesheet (XML back to GEDCOM). Not quite 1:1, but usable “cum grano salis”. For limitations see remarks below.
ANSELentify.* files heavily depend on (conversions of) 2nd window “ans2uni.con” (ZIP) and I owe many thanks to 2nd window “Heiner Eichmann’s GEDCOM 5.5 Sample Page: ANSEL to Unicode conversion” and his 2nd window ANSEL to Unicode Conversion Tool”

Preview GEDCOM-, Script-, XML-Sources

The source/code-previews are simple HTML-exports from the Scintilla Text-Editor 2nd window SciTE.

Siebold’s GEDCOM is indented and foldable just for readability. Indented lines – any leading whitespace! – and empty lines do not conform to the GEDCOM-specification, but ged1212xml tolerates and ignores it.

  1. ged1212xml.awk.htm – awk-code
  2. ged1212xml.wsf.htm – JavaScript/WSH-code
  3. ged1212xml.rev.xsl.htmXSLT-code for reverse transformation
  4. siebold.GED.htm – Philipp Franz von Siebold’s GEDCOM formatted by GEDnoopp
  5. siebold.GED.xml.htm – Philipp Franz von Siebold’s gedcom55XML made by ged1212xml
  6. Ged2HTML Web-Presentation – Philipp Franz von Siebold’s lineage made by Ged2HTML

Get precompiled Win32 awk binaries/variants

  • gawk – GnuWin32.sourceforge.net
  • mawk – GnuWin32.sourceforge.net
  • nawk – GnuWin32.sourceforge.net

I am no real fan of utilities and runtimes with a high system impact (like Java or DotNET), nor of programming-languages requiring big downloads (like Perl, Python, Ruby, etc.). They may however be better suited to elegant solutions.

So I tried with “awk” and with “Windows Script Host” (WSH + Javascript). The latter is available on all MS-Windows, “awk” is standard on all Linuxes/Unixes and for Win32-users a download of negligible size and just unpacking a single binary/executable file (no install, no setup).

Get more GEDCOM files/infos


Usage of ged1212xml.awk

ged1212xml.awk
USAGE: [g|m|n]awk [-v var=value [-v …]] -f ged1212xml.awk 
       [<]infile.GED [>outfile.XML] [2>error.LOG]
NOTES: v-Options are required to be set before f-Options
OPTIONS:
  -v ANSEL=0|1
      start "ANSEL to Entity"-mode before 1st occurence of +n CHAR ANSEL
  -v nsPFX=""|<nmtoken>
      xml namespace prefix, requires setting of nsURI too, default=none
  -v nsURI=""|<uri>
      xml namespace URI for xmlns[:nsPFX]="…", default=none
  -v xmlEnc="iso-8859-1"|<encoding>
      replace xml declaration's default <?xml … encoding="iso-8859-1"?>
  -v xmlStyle=""|<file.css|file.xsl>
      insert processing-instruction <?xml-stylesheet href="…"?>, default=none
  -v xmlRoot="GED"|<nmtoken>
      replace root-element's default tag-name "GED"
  -v xmlID="ID"|<nmtoken>
      replace attribute-name's default "ID" for GEDCOM's @<XREF>@s
  -v xmlIDREF="REF"|<nmtoken>
      replace attribute-name's default "REF" for GEDCOM's @<XREF>@s
  -v xmlDTD=""|<file.dtd>
      insert doctype-definition <!DOCTYPE … SYSTEM "…">, default=none
  -v xsiXSD=""|<file.xsd>
      insert root's xsi:XMLSchema-instance-location-definition, default=none
  -v idPFX=""|"id."|"ged:"|<nmtoken>
      ID-prefix to create valid xmlID/REF-values, default=none
      ID-prefix == string-additive, don't confuse it with namespace-prefixes!
  -v escDATE=""|"ESC"|<nmtoken>
      given name ("ESC" preferred, default=none=noop) 
      moves @#<DATE_CALENDAR_ESCAPE>@s into attributes
  -v surNAME="SURN"|"S"|<nmtoken>|<!nmtoken>
      alter node-name ("S" preferred, default="SURN") for slashed surname-part
      to avoid double SURN-subnodes in an extended NAME-node/structure
      a non-nametoken char/string prevents slash-replacement at all
-v RS="\r"
Two GED-files in the “GEDCOM 5.5 Torture Test” package end lines in a single carriage return. The option sets awk’s “input Record Separator” (a builtin-variable) to this variant of linebreaks.
example: cromwell.cfg.awk
BEGIN {
    ANSEL   = 1 ;
    nsPFX   = "g" ;
    nsURI   = "urn:xmlns:gedcom55XML" ;
    xmlID   = "xml:id" ;
    idPFX   = "gid:" ;
    surNAME = "S" ;
    escDATE = "ESC" ;
}
“awk” allows multiple f-options. Users can collect all v-options specific for a project in a configuration-file. It’s an overuse example, but for the sake of demonstration … a possible result of an INDI-node/record (as exported by “Heredis”):
awk -f cromwell.cfg.awk -f ged1212xml.awk cromwell.ged > cromwell.ged.xml
0 HEAD
  1 SOUR HEREDIS 7 PC
...
0 @221I@ INDI
  1 NAME Sir Oliver/CROMWELL/
    2 GIVN Sir Oliver
    2 SURN CROMWELL
  1 SEX M
  1 BIRT
    2 DATE @#DJULIAN@ 1563
  1 DEAT
    2 DATE 1655
  1 FAMS @317U@
  1 FAMS @227U@
  1 FAMC @204U@
...

<g:GED xmlns:g="urn:xmlns:gedcom55XML">
...
<g:INDI xml:id="gid:221I">
  <g:NAME>Sir Oliver<g:S>CROMWELL</g:S>
    <g:GIVN>Sir Oliver</g:GIVN>
    <g:SURN>CROMWELL</g:SURN>
  </g:NAME>
  <g:SEX>M</g:SEX>
  <g:BIRT>
    <g:DATE ESC="DJULIAN">1563</g:DATE><?DATE 1563-00-00?>
  </g:BIRT>
  <g:DEAT>
    <g:DATE>1655</g:DATE><?DATE 1655-00-00?>
  </g:DEAT>
  <g:FAMS REF="gid:317U"/>
  <g:FAMS REF="gid:227U"/>
  <g:FAMC REF="gid:204U"/>
</g:INDI>
...
The XML-structure is namespaced and prefixed. XREFs are made valid ID/IDREF-values in a similar way (gid:-prefix fakes a namespace just to start with a letter). The xml:id-attribute-name introduces its value being content-type of “ID” to a capable parser even without DTD/Schema. The date-calendar-escape is moved into an ESC-attribute. As a side-effect this enables a transformation of the date to ISO standard format, appended as a processing-instruction. The slashed surname-part (now <g:S>-node) differs from the tagged surname-part (<g:SURN>-node).

Usage of ged1212xml.wsf (JavaScript with WSH)

ged1212xml.wsf (imports ANSELentify.js)
USAGE: cscript //nologo ged1212xml.wsf [/name:value […]] 
       [/ged:infile.ged] [/xml:outfile.xml] [/log:error.log]
       [<stdin.ged] [>stdout.xml] [2>stderr.log]
NOTES: double slashes for cscript-arguments, e.g. //nologo, 
       single slashes for wsf-arguments, as below
OPTIONS:
  FILES
    /ged:<file.ged>
      GEDCOM input-filename, default=STDIN
    /xml:<file.xml>
      XML output-filename, default=STDOUT
    /log:<file.log>
      Logging output-filename, default=STDERR
  GED-INPUT-ENCODING-MODE
    /ans:true
      start "ANSEL to Entity"-mode before 1st occurence of +n CHAR ANSEL
  XML-OUTPUT
    /pfx:<nmtoken>
      xml namespace prefix, requires setting of /uri:<URI> too, default=none
    /uri:<URI>
      xml namespace URI for xmlns[:nsPFX]="…", default=none
    /enc:"iso-8859-1"|<encoding>
      replace xml declaration's default <?xml … encoding="iso-8859-1" ?>
    /sty:<file.css|file.xsl>
      insert processing-instruction <?xml-stylesheet href="…"?>, default=none
    /root:"GED"|<nmtoken>
      replace root-element's default tag-name "GED"
    /id:"ID"|<nmtoken>
      replace attribute-name's default "ID" for GEDCOM's @<XREF>@s
    /ref:"REF"|<nmtoken>
      replace attribute-name's default "REF" for GEDCOM's @<XREF>@s
    /dtd:""|<file.dtd>
      insert doctype-definition <!DOCTYPE … SYSTEM "…">, default=none
    /xsd:""|<file.xsd>
      insert root's xsi:XMLSchema-instance-location-definition, default=none
    /ifx:""|"id."|"ged:"|<nmtoken>
      ID-prefix to create valid xmlID/REF-values, default=none
      ID-prefix == string-additive, don't confuse it with namespace-prefixes!
    /esc:""|"ESC"|<nmtoken>
      given name ("ESC" preferred, default=none=noop) 
      moves @#<DATE_CALENDAR_ESCAPE>@s into attributes
    /sur:"SURN"|"S"|<nmtoken>|<!nmtoken>
      alter node-name ("S" preferred, default="SURN") for slashed surname-part
      to avoid double SURN-subnodes in an extended NAME-node/structure
      a non-nametoken char/string prevents slash-replacement at all

Special behaviour and features

ANSEL-to-Entity

About ANSEL-processing: by default it only switches on with the first occurrence of a GEDCOM header-line "+n CHAR ANSEL" – and maybe off again, if a similar line announces another encoding. In a situation where ANSEL is used even before in header-text, the scripts provide an option to care of ANSEL from the very beginning.

Processing-Instructions

This is experimental. Think of gedcom55XML as a minimal prescription that can be enhanced with additional nodes (elements and attributes in other namespaces) or by-side-instructions.

Analysing or transforming text (text-node-values) with XSLT presents difficulties. In places where the scripts can do better, a processing-instruction (“PI”) with additional results may be named and inserted immediately after the closing tag of an element. PIs are not defined in a DTD or Schema and do not violate any validation, but can easily accessed with XSLT.

Currently it is done if a valid english date form is available in "+n DATE <DATE_EXACT>" lines. In this case the scripts append an ISO-form of the date as …</DATE><?DATE yyyy-mm-dd?>.

In other words, as a general rule: a PI should contain just another representation – e.g. prepared according to standards and usability – of the preceding element’s value.

Date calendar escapes

DATE-line again, but no decision yet: what to do with the 2nd windowdate-calendar-escape” sequences? Any special treatment, or is it just like any other value? According to IDs and REFs, a sequence enclosed in “@” (regular expression: /@#D(GREGORIAN|JULIAN|HEBREW|FRENCH R|ROMAN|UNKNOWN)@/) and having a comparable meta-aspect should be moved into an attribute, e.g. <DATE ESC="DTOKEN">. The scripts provide an option for testing.

BTW: the possible whitespace inside the "FRENCH R"-token/pattern (French Revolutionary Calendar) is annoying. Under certain circumstances the sequence is split into seperate fields and requires another extra exception to be handled.

Slashes to “S” vs “SURN” etc

Slashes delimit and mark the surname-part (like /surname/) of a NAME-structure. By default they are converted to a SURN-subnode, despite there is no hint or convention for the node-naming. A problem may occure, if – according to the standard: optionally and (!) additional – a SURN-tag is present too and therefore doubles the SURN-node. To avoid this, or to prevent a slash-replacement at all, the node-name can be altered by an option, preferably to “S” of Kay’s GedML. A non-nametoken char/string (not type of “NMTOKEN”) will switch off any replacement and leave the slashed surname-part unchanged.

Another usage might be to copy the element-naming of similar GEDCOM/XML-approaches. A configuration like this …

GeniML.cfg.awk
BEGIN {
    xmlRoot  = "GENIML" ;
    xmlStyle = "pedigree.xsl" ;
    surNAME  = "SURNAME" ;
}
… replicates J.Fitzpatrick’s “GeniML” to apply his pedigree stylesheet in a second step:
siebold.GeniML.htm – Siebold’s data transformed by pedigree.xsl.

Valid XML ID/IDREF-values

Two problems to solve. (1) Some genealogical programs create @<XREF>@s with leading digits, conforming to GEDCOM, but not to XML attribute-values of type “ID/IDREF”. (2) IDs must remain unique, even if an application or transformation populates the XML-file with non-GED IDs for other purposes.

None of the problems is critical. Attributes named “ID” or “REF” need not to be type of “ID/IDREF”. Just the ID-mechanisms provided by XML aren’t usable as usual. It is up to you whether a fallback to key- and string-comparison is a flaw or not. Equally use your own algorithm to keep additional IDs unique.

The scripts introduce an option to get around another way: define a string (e.g. “id.” or a namespace-prefix lookalike “ged:”) that precedes all XREF-values. Doing this right makes IDs valid (letter-character first!) and forms a unique group of IDs/REFs originating from the GEDCOM-source. A valid reserved/special character as separator (dot or colon) makes getting rid of any prefix an easy task. Some pseudo-codes to rebuild the XREF:

  • XREF = ID.split(":").pop()
  • XREF = ID.substr(idPFX.length)
  • XREF = substr(ID, match(ID,":"))
  • XREF = substr(ID, length(idPFX))
  • <xsl:variable name="XREF" select="substring-after(@ID,':')"/>
  • ...

Whitespace delimiters

As already mentioned above: GEDCOM lines with leading whitespace (due to indenting) and delimiters consisting of more than exactly one whitespace are tolerated (condensed). Empty lines are ignored. But keep in mind, that such files do not conform to the GEDCOM 5.5 specification.

Broken code

In general: there are no spec-checks! Lines that don’t fit to the (not so strict) patterns are ignored and reported. If they pass as a “false positive” or fail as a “false negative”, this may result in a non well-formed XML. It’s the XML-parser’s task to check this. But of course it’s my turn to improve the patterns and scripts. Please contact me, if the results of conversion are not satisfactory!



Reverse transformation

The ged1212xml.rev.xsl reverse transformation stylesheet included in the archive isn’t a complete 1:1 return to source. Limitations are:

  1. The output-encoding is unicode (utf-8) and corresponding HEAD/CHAR/VERS tags are changed or omitted.
  2. Loss of whitespace is possible due to normalization or default modes of the XSLT-processor.
  3. “@”s are not transformed to double “@@”s.

The stylesheet is parameterized to change the optional element- and attribute-names according to ged1212xml. Some standard issues and names are already covered:

  1. root and namespace (independent)
  2. elements S|SURN|SURNAME revert to slashed /surname/-part
  3. attributes ID|REF|ESC and xml:id revert to @<XREF>@ or @#<DTOKEN>@
  4. removal of id-prefixes in a colon-ized namespace style, e.g. “nn:” in ID="nn:XREF"


Yet to do ?

  1. GEDCOM-standard:
    Objects within a logical record can be associated. If this need exists, the pointer record composition contains an exclamation point (!) that separates the parent record’s cross-reference ID from the specific substructure’s cross-reference ID, which is at some subordinate level to the logical record at level zero. The cross-reference ID of the substructure subordinate to a zero level record, for inter-record associations is always composed of the Record ID number and the Substructure ID number, such as @I132!1@.

    Simply copying the XREF incl. separator-mark unchanged would again make it invalid as XML attribute-values of type “ID/IDREF”.

  2. GEDCOM-standard:
    All user-defined tags, tags used that have not been defined in the GEDCOM standard, must begin with an underscore character.

    Unknown tags~elements cannot be validated (true?). A workaround could be a special element defined to hold the user-GEDCOM-tag as an attribute-value. E.g. <NN TAG="_USER">…</NN>. Ugly side-effect: user-defined tags may occure in any valid combination of GEDCOM-line elements, meaning the whole code has to be duplicated to catch the “tag-to-attribute” exception (opposite to “tag-to-element” default)?


Freeware 2nd window for ZIP, ZIP, ZIP, ZIP & PDF, PDF

Gesetzt aus/für Verdana & Courier
2010 ff. ©|© Stefan Unterstein