GEDCOM 1:1 to XML  
  :: nn :: nn ::    
     


ged1212xml: GEDCOM one-to-one to XML

Abstract

Scripts (awk|wsh ~ Javascript for Windows Script Host) to convert GEDCOM-data – GEnealogical Data COMmunication – one-to-one to wellformatted XML; wellformed and valid alike the GEDCOM-source. Even very big GED-files should be convertible requiring only small system-resources. The target-format – defined as “GEDCOM 5.5 XML” elsewhere – enables validation against the GEDCOM 5.5 standard using XML-mechanisms and schemata (RelaxNG + Schematron).

Additional features include conversion of dates to ISO 8601 standard, creation of UUIDs v4 according to RFC 4122, and a check of _UID-tag-values for compliance with the PAF quasi-standard, presenting valid replacements to preserve the relevant 128-bit-value of malformed formats.

CONTENTS [javascript:makeTOC()]
Links to other sites [2nd window] target the same one’n’only second window.

About / Motivation

The scripts (“ged1212xml”) provided on this page attempt a close (~100%) one-to-one conversion of GEDCOM-data into a very simple XML, according to a project by Chad Albers called “GEDCOM 5.5 XML” (“gedcom55XML”) …

“… all GEDCOM tags are translated into XML elements; open and closed elements delimit the data; and the elements are nested in the same way prescribed by the GEDCOM specification.” (…)
GEDCOM 5.5 XML attempts to be a 100 percent one-to-one translation of GEDCOM 5.5 into XML; it even includes the superfluous (and empty) <TRLR/> element.” [¹]
GEDCOM 5.5 XML differentiates itself from GedML and GeniML because it attempts to replicate the LDS's GEDCOM 5.5 standard using XML markup. Without exception, all GEDCOM 5.5 tags should correspond to XML elements with the same name; all tags should be preserved; the parent-child relationships between the tags and elements should parallel one another; and all data delimited by the elements should fall within the strict guidelines of the standard.” [²]
[¹] Chad Albers at neomantic.com/gedcom55XML
[²] cf. the README notes on the competing GedML & GeniML for his gedcom55XML approach.
  • more semantics
  • more data-structures
  • more cross-references
  • more encodings
  • more extendable
  • more readable (source)
  • more free tools
  • more …, &c

Searching the web for GEDCOM & XML is rather disappointing nowadays, except for a very few sites, that are surprisingly up to date and still active on this topic. Most projects during the past “GEDCOM to XML”-hype made large promises when using all the XML capabilities: ›››

All true, all possible. But equally most projects remain drafts, and seem to have been abandoned. While even the task of a 1:1 translation isn’t really completed.

Installing/running “big” genealogical software (mostly shareware) to simply import GEDCOM (with hidden loss of data?) and export to some XML (more data loss?) is a fussy and risky way to get all already collected genealogical data available for further xml-processing. Filter-action (import to export) is not the task such software is made for. The background history (s. blog-entry Pandora’s Box) of Tim Forsythe’s VGed (former “GEDCOM Validator”, now VGedX) tells more of the whole story.

Technologies go by, data stay. Data + Format (as a generic markup, attaching computable semantic to data) represent the real hard efforts made by humans on research, structure and validity. They must be preserved under all circumstances of changing technology. Loss of data, de-structuring (makes data less read- and usable for humans and machines), and opaque formats are the worst faults. They result in data-cemeteries that have to be human touched, checked and rekeyed again and again. Content without open-standards markup is dead, uncomputable, until reanimated by human review.

So: why this fallback to a seemingly simplistic, half buried approach? To a project that Michael H. Kay began with GedML as the pioneer he is in many respects, and that everyone everywhere refers to? What are the remaining benefits?

First of all it’s a simple format, easy to create, easy to control, and it’s not new. Efforts made before (XSLT stylesheets) may be reused with only minor changes. Roots to the established GEDCOM 5.5 standard – that XML was never successful to replace or become heir to – aren’t cut off. To get your hands on the genealogical data is straight forward, and the next step of processing can already be done with XML/XSL-tools, e.g. transforming it into more ambitious XML dialects. For this – in despite of unfortunately using an own intermediate XML-format – you may cf. Bill Kinnersley’s worth reading GEDC documentation, his XML-based standard and application.

Not at least – see Chad Albers’ approach with RelaxNG/Schematron – structure and data-types of a GED-file can be validated (using XML-mechanisms) against the GEDCOM 5.5 standard, if they remain nearly unchanged. The scripts aim to be a possible replacement of the first step in his workflow (ongoing to XSL-FO formatting-objects and PDF).

Interested in testing or even using this? Feel free …
… to let me know about the good, the bad, and the ugly things too.

Download

History

  1. [2008-09-18] – pre-release (testers)
  2. [2008-10-01] – initial release (public)
  3. [2008-10-11]
    • option added to differ slashed from tagged surname-parts by node-naming
    • XML-output additionally formatted with blank lines for easier “visual parsing”
    • GEDnoopp.awk added to archive to (un-)format GED-files likewise, as shown below
  4. [2008-11-20]
    • ged1212xml.rev.xsl XSLT-stylesheet added for rudimentary reverse transformation
  5. [2011-01-20]
    • fixed a logical ambiguity in evaluating the surname-script-option
    • added “Universally Unique IDentifiers” v4 (UUID) conforming to RFC 4122.
    • added (pseudo-xmlns) prefixes – though functional meaningless – for PIs too.
  6. [2011-01-22]
    • fixed XML/namespace-spec violations: NMTOKEN vs NCName (2nd window)
    • (NCName represents XML “non-colonized” Names – a no-colon constraint)
    • changed PI-target-prefixes (now avoid colons!) from default to an option
    • renamed some options concerning namespaces/prefixes
  7. [2011-02-02]
    • complete renewal of processing-instruction handling:
    • predefined attribute- and function-PI-styles; “prefixes” replaced by name-config; etc.
    • ged1212xml.PI.xsl added to archive: a stylesheet-skeleton to tune PIs (or more)
  8. [2013-02-03]
    • added processing-instruction for _UID-tag-lines:
    • computed vs given UUID-value/format, default check for PAF-compatibility
    • GED_UID.fix.awk added to archive: stand-alone check or fix (transform/replace) value and format of _UID-tag-lines in GEDCOM-files
  9. [2013-06-03]
    • added UURN, modified XURN output format
    • (of UUIDs for _UID-tag or processing-instructions)

Archive Contents

ged1212xml.awkged1212xml.awk.htm
awk-script to translate (hopefully) any GEDCOM file one-to-one to XML. It is “stand alone”, i.e. ANSEL-to-Entity routines are already included.
ANSELentify.awk
ANSEL-to-Entity as an awk-script of its own. Do not run before ged1212xml! All the entities included would be deactivated through an ampersand-translation to &amp;, and that’s not what intended.
ANSELentify.sed
ANSEL-to Entity as sed-script. Just another offspring for the “Stream EDitor”.
ged1212xml.wsfged1212xml.wsf.htm
Javascript to be run in the “Windows Script Host” (WSH) engine. Varying from above, this is not “stand alone”, but imports (i.e. requires) ANSELentify.js at runtime.
ANSELentify.js
ANSEL-to-Entity code imported + executed by ged1212xml.wsf.
GEDnoopp.awk
GEDCOM normalize or pretty print” – format a GED-file with indents and blank lines (visually group records) in a first run; “normalize” (remove, undo) the formattings according to standard in a second run to restore its validity.
GED_UID.fix.awkGED_UID.fix.awk.htm (→ UUID Excurses)
Reads GEDCOM files, checks the _UID-tags, presents replacments in a target-format and – at user’s choice – replaces them, to make some more imports possible w/o UUIDs being rejected. This aims to transform UUID-standards and variants (e.g. checksum, mixed-case, grouping, URN or GUID) into each other, and to check & fix some deviate or flawed forms according to wellformed (cf. further readings), as long as a 128-bit-value can be saved.
ged1212xml.rev.xslged1212xml.rev.xsl.htm
Reverse Transformation Stylesheet (XML back to GEDCOM). Not quite 1:1, but usable “cum grano salis”. For limitations see remarks below.
ged1212xml.PI.xsl
A stylesheet-skeleton to adjust the style of PIs in a generated XML-file. All other nodes are passed through. It’s meant as a starting point for users, when the scripts’ PI-configuration is not satisfactory. Maybe handy for other kinds of transformation too.
ANSELentify.* files heavily depend on (conversions of) “ans2uni.con” (ZIP) and I owe many thanks to “Heiner Eichmann’s GEDCOM 5.5 Sample Page: ANSEL to Unicode conversion” and his ANSEL to Unicode Conversion Tool”
UUID-js-code ist derived and varied from Robert Kieffer’s “Math.uuid.js” (JS) at his Broofa Blog-Page. Thanks! It inspired my awk-code too.

Preview GEDCOM-, Script-, XML-Sources

The source/code-previews are simple HTML-exports from the Scintilla Text-Editor SciTE.

Siebold’s GEDCOM is indented and foldable just for readability. Indented lines – any leading whitespace! – and empty lines do not conform to the GEDCOM-specification, but ged1212xml tolerates and ignores it.

  1. ged1212xml.awk.htm – awk-code
  2. ged1212xml.wsf.htm – JavaScript/WSH-code
  3. ged1212xml.rev.xsl.htmXSLT-code for reverse transformation
  4. GED_UID.fix.awk.htm – awk-code for _UID checks and replacements
  5. siebold.GED.htm – Philipp Franz von Siebold’s GEDCOM formatted by GEDnoopp
  6. siebold.GED.xml.htm – Philipp Franz von Siebold’s gedcom55XML made by ged1212xml
  7. Ged2HTML Web-Presentation – Philipp Franz von Siebold’s lineage made by Ged2HTML

Get precompiled Win32 awk binaries/variants

  • gawk – GnuWin32.sourceforge.net
  • mawk – GnuWin32.sourceforge.net
  • nawk – GnuWin32.sourceforge.net

I am no real fan of utilities and runtimes with a high system impact (like Java or DotNET), nor of programming-languages requiring big downloads (like Perl, Python, Ruby, etc.). They may however be better suited to elegant solutions.

So I tried with “awk” and with “Windows Script Host” (WSH + Javascript). The latter is available on all MS-Windows, “awk” is standard on all Linuxes/Unixes and for Win32-users a download of negligible size and just unpacking a single binary/executable file (no install, no setup, no impact).

Get more GEDCOM files/infos


Usage of ged1212xml.awk

ged1212xml.awk
USAGE: [g|m|n]awk [-v var=value [-v …]] -f ged1212xml.awk 
       [<]infile.GED [>outfile.XML] [2>error.LOG]
NOTES: v-Options are required to be set before f-Options
       NCName represents XML "non-colonized" Names (no-colon constraint)
OPTIONS:
  -v ANSEL=0|1
      start "ANSEL to Entity"-mode before 1st occurence of +n CHAR ANSEL
  -v nsPFX=""|<ncname>
      xml namespace prefix, requires setting of nsURI too, default=none
  -v nsURI=""|<uri>
      xml namespace URI for xmlns[:nsPFX]="…", default=none
  -v xmlEnc="iso-8859-1"|<encoding>
      replace xml declaration's default <?xml … encoding="iso-8859-1"?>
  -v xmlStyle=""|<file.css|file.xsl>
      insert processing-instruction <?xml-stylesheet href="…"?>, default=none
  -v xmlRoot="GED"|<ncname>
      replace root-element's default tag-name "GED"
  -v xmlID="ID"|"xml:id"|<ncname>
      replace attribute-name's default "ID" for GEDCOM's @<XREF>@s
  -v xmlIDREF="REF"|<ncname>
      replace attribute-name's default "REF" for GEDCOM's @<XREF>@s
  -v xmlDTD=""|<file.dtd>
      insert doctype-definition <!DOCTYPE … SYSTEM "…">, default=none
  -v xsiXSD=""|<file.xsd>
      insert root's xsi:XMLSchema-instance-location-definition, default=none
  -v idPFX=""|"id."|"ged-"|<ncname>
      ID-prefix for valid xmlID/REF-values (NCNames), default=none
      ID-prefix == string-additive, don't confuse it with namespace-prefixes!
  -v escDATE=""|"ESC"|<ncname>
      given name ("ESC" preferred, default=none=noop) 
      moves @#<DATE_CALENDAR_ESCAPE>@s into attributes
  -v surNAME="SURN"|"S"|<ncname>|<!ncname>
      alter node-name ("S" preferred, default="SURN") for slashed surname-part
      to avoid double SURN-subnodes in an extended NAME-node/structure
      a non-ncname char/string prevents slash-replacement at all
  -v piSTY=""|"attr"|"func"|"void"|"nopi"
      predefined attribute- or function-style for processing-instructions
      default="void" ~ empty for user-defined styles, otherwise plain style
      a non-defined value (like "nopi") prevents PI-generation at all
  -v piNCN=""|<ncname>
      PI-ncname for processing-instruction-targetnames, default=none
  -v datePI="DATE"|<ncname>|<!ncname>
      processing-instruction-targetname, default="DATE" becomes <?DATE ...?>
      date-format converted (if possible) to YYYY-MM-DD according to ISO 8601
      a non-ncname char/string prevents DATE PI-generation
  -v uuidPI="_UID"|"GUID"|"UUID"|"XUID"|"UURN"|"XURN"|<ncname>|<!ncname>
      processing-instruction-targetname, default="UUID" becomes <?UUID ...?>
      Universally Unique IDentifiers v4 (pseudo-random) according to RFC 4122
      a standard-name is default-format for uuidSTY-option
      a non-ncname char/string prevents UUID PI-generation
  -v _uidPI="_UID"|"GUID"|"UUID"|"XUID"|"UURN"|"XURN"|<ncname>|<!ncname>
      processing-instruction-targetname, default="_UID" becomes <?_UID_n ...?>
      checks _UID-tag, default according to PAF-style UUID+Checksum (n=0|1|X)
      a standard-name is default-format for _uidSTY-option
      a non-ncname char/string prevents _UID PI-generation
  -v uuidSTY=<uuidPI-standard-targetname-format>|"UUID"|<targetformat>
  -v _uidSTY=<_uidPI-standard-targetname-format>|"_UID"|<targetformat>
      "_UID" XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCCCC  (_uidSTY-default)
             PAF-GEDCOM-_UID 16+2 bytes, 36 chars uppercase hexdigit with checksum
      "UUID" xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  (uuidSTY-default)
             RFC-4122-UUIDv4 16 bytes, 32+4 chars lowercase hexdigit hyphen-grouped
      "GUID" {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}
             embraced {UUIDv4} 16 bytes, 32+6 chars uppercase hexdigit hyphen-grouped
      "XUID" {XxXXxXXx-xxxX-xxxX-xXXX-xxxxXXxxXxXx}cccc
             extended mixedcase and -style {GUIDv4}, 4-hexdigit checksum appended
      "UURN" urn:uuid:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
             prefixed lowercase "urn:uuid:UUIDv4" (RFC-4122, UUID as URN)
      "XURN" urn:uuid:xXxxXxXX-xxxX-xXxX-xXxX-xXXXXxxXXxXx+cccc
             extended mixedcase "urn:uuid:UUIDv4+checksum" (RFCs 2141+3986+4122)
       else: XXxXXXxX-Xxxx-xXXX-XXxX-xxxXxXxxXXXx cccc
             combined mixedcase UUIDv4 with 4-hexdigit checksum (set apart)
  -v uuidSEED=<integer>
      default=srand()
-v RS="\r"
Two GED-files in the “GEDCOM 5.5 Torture Test” package end lines in a single carriage return. The option sets awk’s “input Record Separator” (a builtin-variable) to this variant of linebreaks.
example: cromwell.cfg.awk
BEGIN {
    ANSEL   = 1 ;
    nsPFX   = "g" ;    # w/o colon!
    nsURI   = "urn:xmlns:gedcom55XML" ;
    xmlID   = "xml:id" ;
    idPFX   = "gid." ; # w/o colon!
    surNAME = "S" ;
    escDATE = "ESC" ;
}
“awk” allows multiple f-options. Users can collect all v-options specific for a project in a configuration-file. It’s an overuse example, but for the sake of demonstration … a possible result of an INDI-node/record (as exported by “Heredis”):
awk -f cromwell.cfg.awk -f ged1212xml.awk cromwell.ged > cromwell.ged.xml
0 HEAD
  1 SOUR HEREDIS 7 PC
...
0 @221I@ INDI
  1 NAME Sir Oliver/CROMWELL/
    2 GIVN Sir Oliver
    2 SURN CROMWELL
  1 SEX M
  1 BIRT
    2 DATE @#DJULIAN@ 1563
  1 DEAT
    2 DATE 1655
  1 FAMS @317U@
  1 FAMS @227U@
  1 FAMC @204U@
...

<g:GED xmlns:g="urn:xmlns:gedcom55XML">
...
<g:INDI xml:id="gid.221I">
  <g:NAME>Sir Oliver<g:S>CROMWELL</g:S>
    <g:GIVN>Sir Oliver</g:GIVN>
    <g:SURN>CROMWELL</g:SURN>
  </g:NAME>
  <g:SEX>M</g:SEX>
  <g:BIRT>
    <g:DATE ESC="DJULIAN">1563</g:DATE><?DATE 1563-00-00?>
  </g:BIRT>
  <g:DEAT>
    <g:DATE>1655</g:DATE><?DATE 1655-00-00?>
  </g:DEAT>
  <g:FAMS REF="gid.317U"/>
  <g:FAMS REF="gid.227U"/>
  <g:FAMC REF="gid.204U"/>
</g:INDI><?UUID 82ce4049-26f0-4e9a-af0e-ca94356ff680?>
...
The XML-structure is namespaced and prefixed. XREFs are made valid ID/IDREF-values in a similar way (gid.-prefix fakes a namespace just to start with a letter). The xml:id-attribute-name introduces its value being content-type of “ID” to a capable parser even without DTD/Schema. The date-calendar-escape is moved into an ESC-attribute. As a side-effect this enables a transformation of the date to ISO standard format, appended as a processing-instruction. The slashed surname-part (now <g:S>-node) differs from the tagged surname-part (<g:SURN>-node). A <?UUID…?>-pi appended after the closing <g:INDI>-node tag waits for utilization.

Usage of ged1212xml.wsf (JavaScript with WSH)

ged1212xml.wsf (imports ANSELentify.js)
USAGE: cscript //nologo ged1212xml.wsf [/name:value […]] 
       [/ged:infile.ged] [/xml:outfile.xml] [/log:error.log]
       [<stdin.ged] [>stdout.xml] [2>stderr.log]
NOTES: double slashes for cscript-arguments, e.g. //nologo, 
       single slashes for wsf-arguments, as below
       NCName represents XML "non-colonized" Names  (no-colon constraint)
OPTIONS:
  FILES
    /ged:<file.ged>
      GEDCOM input-filename, default=STDIN
    /xml:<file.xml>
      XML output-filename, default=STDOUT
    /log:<file.log>
      Logging output-filename, default=STDERR
  GED-INPUT-ENCODING-MODE
    /ans:true
      start "ANSEL to Entity"-mode before 1st occurence of +n CHAR ANSEL
  XML-OUTPUT
    /nspfx:<ncname>
      xml namespace prefix, requires setting of /uri:<URI> too, default=none
    /nsuri:<URI>
      xml namespace URI for xmlns[:nsPFX]="…", default=none
    /enc:"iso-8859-1"|<encoding>
      replace xml declaration's default <?xml … encoding="iso-8859-1" ?>
    /sty:<file.css|file.xsl>
      insert processing-instruction <?xml-stylesheet href="…"?>, default=none
    /root:"GED"|<ncname>
      replace root-element's default tag-name "GED"
    /id:"ID"|"xml:id"|<ncname>
      replace attribute-name's default "ID" for GEDCOM's @<XREF>@s
    /ref:"REF"|<ncname>
      replace attribute-name's default "REF" for GEDCOM's @<XREF>@s
    /dtd:""|<file.dtd>
      insert doctype-definition <!DOCTYPE … SYSTEM "…">, default=none
    /xsd:""|<file.xsd>
      insert root's xsi:XMLSchema-instance-location-definition, default=none
    /idpfx:""|"id."|"ged-"|<ncname>
      ID-prefix for valid xmlID/REF-values (NCNames), default=none
      ID-prefix == string-additive, don't confuse it with namespace-prefixes!
    /esc:""|"ESC"|<ncname>
      given name ("ESC" preferred, default=none=noop) 
      moves @#<DATE_CALENDAR_ESCAPE>@s into attributes
    /sur:"SURN"|"S"|<ncname>|<!ncname>
      alter node-name ("S" preferred, default="SURN") for slashed surname-part
      to avoid double SURN-subnodes in an extended NAME-node/structure
      a non-ncname char/string prevents slash-replacement at all
    /pi:""|"attr"|"func"|"void"|"nopi"
      predefined attribute- or function-style for processing-instructions
      default="void" ~ empty for user-defined styles, otherwise plain style
      a non-defined value (like "nopi") prevents PI-generation at all
    /pincn:""|<ncname>
      PI-ncname for processing-instruction-targetnames, default=none
    /datepi:"DATE"|<ncname>|<!ncname>
      processing-instruction-targetname, default="DATE" becomes <?DATE ...?>
      date-format converted (if possible) to YYYY-MM-DD according to ISO 8601
      a non-ncname char/string prevents DATE PI-generation
    /uuidpi:"_UID"|"GUID"|"UUID"|"XUID"|"UURN"|"XURN"|<ncname>|<!ncname>
      processing-instruction-targetname, default="UUID" becomes <?UUID ...?>
      Universally Unique IDentifiers v4 (pseudo-random) according to RFC 4122
      a standard-name is default-format for uuidsty-option
      a non-ncname char/string prevents UUID PI-generation
    /_uidpi:"_UID"|"GUID"|"UUID"|"XUID"|"UURN"|"XURN"|<ncname>|<!ncname>
      processing-instruction-targetname, default="_UID" becomes <?_UID_n ...?>
      checks _UID-tag, default according to PAF-style UUID+Checksum (n=0|1|X)
      a standard-name is default-format for _uidsty-option
      a non-ncname char/string prevents _UID PI-generation
    /uuidsty:<uuidPI-standard-targetname-format>|"UUID"|<targetformat>
    /_uidsty:<_uidPI-standard-targetname-format>|"_UID"|<targetformat>
      "_UID" XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCCCC (_uidsty-default)
             PAF-GEDCOM-_UID 16+2 bytes, 36 chars uppercase hexdigit with checksum
      "UUID" xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (uuidsty-default)
             RFC-4122-UUIDv4 16 bytes, 32+4 chars lowercase hexdigit hyphen-grouped
      "GUID" {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}
             embraced {UUIDv4} 16 bytes, 32+6 chars uppercase hexdigit hyphen-grouped
      "XUID" {xxXXxxXx-xXXx-xXXX-XXXx-XXXxxxxxxxxX}cccc
             extended mixedcase and -style {GUIDv4}, 4-hexdigit checksum appended
      "UURN" urn:uuid:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
             prefixed lowercase "urn:uuid:UUIDv4" (RFC-4122, UUID as URN)
      "XURN" urn:uuid:xXXXxxXX-xxxx-XxXX-xxXX-xXxxXXXxxxxx+cccc
             extended mixedcase "urn:uuid:UUIDv4+checksum" (RFCs 2141+3986+4122)
       else: XXXxxxXX-XXXx-xXXX-xxxX-XXxXXxxxXXXX cccc
             combined mixedcase UUIDv4 with 4-hexdigit checksum (set apart)
    /seed:noop
      built-in random() not seedable in ECMAScript

Special behaviour and features

ANSEL-to-Entity

About ANSEL-processing: by default it only switches on with the first occurrence of a GEDCOM header-line "+n CHAR ANSEL" – and maybe off again, if a similar line announces another encoding. In a situation where ANSEL is used even before in header-text, the scripts provide an option to care of ANSEL from the very beginning.

Processing-Instructions

This is experimental. Think of gedcom55XML as a minimal prescription that can be enhanced with additional nodes (elements and attributes in other namespaces) or by-side-instructions.

Analysing or transforming text (text-node-values) with XSLT presents difficulties. In places where the scripts can do better, a processing-instruction (“PI”) with additional results may be named and inserted immediately after the closing tag of an element. PIs are not defined in a DTD or Schema nor do violate any validation, but can easily accessed with XSLT.

PI: DATE

Currently it is done if a valid english date form is available in "+n DATE <DATE_EXACT>" lines. In this case the scripts append an ISO-form of the date as …</DATE><?DATE yyyy-mm-dd?>.

In other words, as a general rule: a PI should contain just another representation – e.g. prepared according to standards and usability – of the preceding element’s value.

PI: UUID

For whatever it may usefull – “A UUID is 128 bits long, can guarantee uniqueness across space and time, and requires no central registration process.” – all level zero nodes/records "0 @<XREF>@ TOKEN" get a <?UUID xxxxxxxx-xxxx-4xxx-Yxxx-xxxxxxxxxxxx?> appended by default after the closing tag. Some genealogical programs already generate and use non-standard but lookalike UUID in non-standard usermade GEDCOM-tags like "_UID".

Diverging from the RFC 4122 UUID-spec – “The hexadecimal values "a" through "f" are output as lower case characters and are case insensitive on input.” – the output of randomly mixed-case letters enhances the syntactic scope and thus raises uniqueness where sensitive syntax matters (e.g. xml:id). Users can normalize UUIDs to lower/upper case or keep the difference.

Style of PI
<?target value?>
General format of processing-instructions; target must be a “non-colonized” name (NCName)
<?DATE yyyy-mm-dd?>
<?UUID xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx?>
BEGIN {
    piSTY   = "void" ;  # all settings are default too
    datePI  = "DATE" ;
    uuidPI  = "UUID" ;
    uuidSTY = "UUID" ;  # RFC-4122-UUIDv4
    _uidPI  = "_UID" ;
    _uidSTY = "_UID" ;  # PAF-GEDCOM-_UID
}
Default settings (only decl. for demo) include all PIs “plain styled”, i.e. PI-target denotes PI-value being just the “flat” generated value. — Any undefined piSTY-token disables all PIs, or a non-NCName disables the targeted PI only. The variables uuidSTY and _uidSTY mirror their corresponding uuidPI and _uidPI as long as they contain valid/standard output format-tokens ("_UID" | "GUID" | "UUID" | "XUID" | "UURN" | "XURN"), otherwise default to "UUID" and "_UID", if not explicitly user-set.
awk -v piSTY=nopi ...
or  -f config.awk ...
BEGIN {
    piSTY  = 0 ; # disables all PIs
    # or some/one of … selected PIs
    # datePI = 0 ;
    # uuidPI = 0 ;
    # _uidPI = 0 ;
}
<?GED DATE="…value…"?>
<?GED UUID="…value…"?>
BEGIN {
    piSTY  = "attr" ;
}
PI-target defaults to XML-Root-Element (name’s local part); PI-value uses predefined attribute-style/names
<?GedCom DATE("…value…");?>
<?GedCom UUID("…value…");?>
BEGIN {
    piSTY    = "func" ;
    xmlRoot  = "GedCom" ;
}
PI-target mirrors xmlRoot; PI-value uses predefined function-style/names
<?app isoDate("…value…");?>
<?app uuid_v4("…value…");?>
BEGIN {
    piNCN  = "app" ;  # target ~ NCName w/o colons required!
    piSTY  = "func" ;
    datePI = "isoDate" ;
    uuidPI = "uuid_v4" ;
}
PI-target set to an application; PI-value uses predefined function-style and assumed function-names
<?app ged_date("I83336","DEAT","…value…");?>
<?app ged_uuid("I83336","…value…");?>

sabcmd ged1212xml.PI.xsl the.GED.xml

XSLT can “Pimp My PI”, e.g. include related data (preceding/parent ID, parent node name) as additional function arguments, and of course much more PIs, nodes and data – or transform e’thing to s’thing completely different.
Style of UUID
<?_UID XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCCCC?>
<?GUID {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}?>
<?UUID xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx?>
<?XUID {xxXXXxXX-XXxx-xxxX-xXxX-XXxXXXXXXXXX}cccc?>
<?UURN urn:uuid:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx?>
<?XURN urn:uuid:xXXXxXXx-xXXx-xxXx-xXXx-xXXxXXXxxXxX+cccc?>
fallback-format: XXxxXXxX-XXXx-XXXX-XXXx-XXxxxxXxxxXx cccc
New (self-) generated UUIDs are always of RFC-4122 random type v4, independent of a grouped or straight format. Divergent from standard, the generator outputs randomly mixed-case letters. The non-standard XUID-, XURN- and fallback-targetformats (if user’s choice of “format” is an undefined token) are case-preserving, but easy to convert.
PI: _UID

The _UID-check appends a (kind of) boolean suffix to the PI-targetname, so that different types of result (full, partial or copy/no replacement of value and format) can be distinguished. The suffix refers to the rating of the given _UID-value: [1=true] is flawless and identical to the targeted format, [X=eXchange, i.e. true in a sense] is malformed but recoverable (different but transformable), [0=false] is completely broken or strange.

<?_UID_0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCCCC?>
<?_UID_X XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCCCC?>
<?_UID_1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCCCC?>
<targetname> + suffix
  • _0 ("false") given _UID seems broken (no 128-bit value to extract), targeted _UID has new value and format. The PI-value is a full replacement.
  • _X ("eXchange") given _UID’s 128-bit value is preserved and transformed into the target-format (maybe just a change of lettercase or a new/valid checksum). The PI-value is a partial replacement.
  • _1 ("true") given and targeted _UID value and format are identical. The PI-value is a copy (no replacement).

Given valid source-values (and their notation-fragments) take precedence over generated values. Joint with the case-preserving XUID- and fallback-targetformats, a mixed-case output may result from the source (copy of case) or the generator (randomly mixed case). But as long as vendors do not provide an algorithm of creation, mixedcase source-UUIDs are not really comparable at string-level. The patterns are most likely always different and recommended for change. Beyond that, the lettercase is not recoverable after a normalization or change of format.

All PI-values resulting from the _UID-check are complete (“full”) values in itself. Simply replacing all _UID-values with PI-values during a XSLT transformation should do no harm. The “GED_UID.fix.awk” script (see archive and “UUID Excurses” below) can do this check and replacement at GEDCOM-file sourcelevel.

Date calendar escapes

DATE-line again, but no decision yet: what to do with the “date-calendar-escape” sequences? Any special treatment, or is it just like any other value? According to IDs and REFs, a sequence enclosed in “@”

/@#D(GREGORIAN|JULIAN|HEBREW|FRENCH R|ROMAN|UNKNOWN)@/
regular expression

… and having a comparable meta-aspect should be moved into an attribute, e.g. <DATE ESC="DTOKEN">. The scripts provide an option for testing.

BTW: the possible whitespace inside the "FRENCH R"-token/pattern (French Revolutionary Calendar) is annoying. Under certain circumstances the sequence is split into seperate fields and requires another extra exception to be handled.

Slashes to “S” vs “SURN” etc

Slashes delimit and mark the surname-part (like /surname/) of a NAME-structure. By default they are converted to a SURN-subnode, despite there is no hint or convention for the node-naming. A problem may occure, if – according to the standard: optionally and (!) additional – a SURN-tag is present too and therefore doubles the SURN-node. To avoid this, or to prevent a slash-replacement at all, the node-name can be altered by an option, preferably to “S” of Kay’s GedML. A non-Non-Colonized-Name char/string (not type of “NCName”) will switch off any replacement and leave the slashed surname-part unchanged.

GeniML replication

Another usage might be to copy the element-naming of similar GEDCOM/XML-approaches. A configuration like this …

GeniML.cfg.awk   [included in ged1212xml.zip]
BEGIN {
    xmlRoot  = "GENIML" ;
    xmlStyle = "pedigree.xsl" ;
    surNAME  = "SURNAME" ;
    piSTY    = 0 ; # need no PIs
}

… replicates J.Fitzpatrick’s “GeniML” (2nd window) to apply his pedigree stylesheet in a second step:
siebold.GeniML.htm – Siebold’s data transformed by pedigree.xsl.

Valid XML ID/IDREF-values

Two problems to solve. (1) Some genealogical programs create @<XREF>@s with leading digits, conforming to GEDCOM, but not to XML attribute-values of type “ID/IDREF”. (2) IDs must remain unique, even if an application or transformation populates the XML-file with non-GED IDs for other purposes.

None of the problems is critical. Attributes named “ID” or “REF” need not to be type of “ID/IDREF”. Just the ID-mechanisms provided by XML aren’t usable as usual. It is up to you whether a fallback to key- and string-comparison is a flaw or not. Equally use your own algorithm to keep additional IDs unique.

The scripts introduce an option to get around another way: define a string (e.g. “id.” or a namespace-prefix lookalike “ged-”) that precedes all XREF-values. Doing this right makes IDs valid (letter-character first!) and forms a unique group of IDs/REFs originating from the GEDCOM-source. A valid reserved/special character as separator (dot or dash, no colon!) makes getting rid of any prefix an easy task. Some pseudo-codes to rebuild the XREF:

  • XREF = ID.split(".").pop()
  • XREF = ID.substr(idPFX.length)
  • XREF = substr(ID, match(ID,"."))
  • XREF = substr(ID, length(idPFX))
  • <xsl:variable name="XREF" select="substring-after(@ID,'.')"/>
  • ...

Whitespace delimiters

As already mentioned above: GEDCOM lines with leading whitespace (due to indenting) and delimiters consisting of more than exactly one whitespace are tolerated (condensed). Empty lines are ignored. But keep in mind, that such files do not conform to the GEDCOM 5.5 specification.

Broken code

In general: there are no spec-checks! Lines that don’t fit to the (not so strict) patterns are ignored and reported. If they pass as a “false positive” or fail as a “false negative”, this may result in a non well-formed XML. It’s the XML-parser’s task to check this. But of course it’s my turn to improve the patterns and scripts. Please contact me, if the results of conversion are not satisfactory!



XML Excurses

Colon in PI targets ?

It seems worth to repeat and thus spread this commonly unknown (me too, prev.) constraint for the Name-production of PIs/IDs: beware of the colon …

> The question is, is [colon in PI targets] allowed, and if not, why?

Not allowed, because colons are allowed only in element and attribute names. There is a note near the end of XML-Namespaces [2nd window] explaining this. (Of course they are allowed in vanilla XML 1.0 [2nd window] without namespaces, but they are discouraged, unless you are experimenting with a namespace mechanism other than XML-Namespaces.)

The mechanism for mapping PI targets to URIs is a notation declaration, as explained in XML-Rec. [¹]
Colons are not allowed in PI names *for namespace processing*; they certainly are allowed in XML 1.0.

If it is meant to be XML 1.0 conformant, it should allow colons when you are not performing namespace processing (though it's probably a good idea not to use them anyway).

Others may correct me, but I don't think that a conformant processor is allowed to reject well-formed XML (expect, perhaps, if the processing is performing validation and the document is not valid). Namespaces cannot change this, since XML 1.0 is an independent standard. [²]
[¹] XML-DEV (2nd window) of May 19, 1999
[²] XML-DEV (2nd window) ditto

Colon in ID values ?

Once again: beware of the colon …

It follows that in a namespace-well-formed document:
  • All element and attribute names contain either zero or one colon;
  • No entity names, processing instruction targets, or notation names contain any colons.
In addition, a namespace-well-formed document may also be namespace-valid.
Definition: A namespace-well-formed document is namespace-valid if it is valid according to the XML 1.0 specification, and all tokens other than element and attribute names which are REQUIRED, for XML 1.0 validity, to match the XML production for Name match this specification’s production for NCName.
It follows that in a namespace-valid document:
  • No attributes with a declared type of ID, IDREF(S), ENTITY(IES), or NOTATION contain any colons.

[¹] Namespaces in XML 1.0: Conformance of Documents
[²] cf. XML Schema Part 2 Datatypes: ID


UUID Excurses

From WP UUID: A Universally Unique ID (UUID) is a 16-octet/byte (128-bit) number. In its canonical form, a UUID is represented by 32 hexadecimal digits, displayed in five groups separated by hyphens, in the form 8-4-4-4-12 for a total of 36 characters (32 alphanumeric characters and four hyphens), like this …

xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
with x=[0-9a-f]

Createable by everybody, the version #4 (UUIDv4) of the variants defined in RFC-4122 is generated with random numbers and identifies itself (its version) by constraints for two half-byte/string-positions (“nibbles”) – thus having 6 fixed bits saying they are random and 122 random bits – like this …

xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx
with y=[89ab]

It’s the bits !

As far as I can say: the uniqueness of a UUID refers to (“is”) its 128-bit number, a binary value only! Not a string-pattern (straight or grouped) representing this value, not the lettercase of hexadecimal digits, not some separators grouping the digits, not a checksum (binary or string) anywhere, not any delimiters (curly embracing GUIDs), nor anything else.

The notation (representation) of an 128-bit value is a matter of standards (RFC-4122 UUIDs) or quasi-standards (MS GUIDs, PAF GEDCOM _UIDs). The string-pattern may be of importance in other contexts, for other purposes. For example just to identify the function of an 128-bit value as an unique identifier, or furthermore (now truly as string) serving as datatype “ID” in the pure textual context of XML.

A transformation of a given (identifiable) 128-bit value into different notations – and vice versa – without loss of identity and uniqueness is (or should be) possible and (more or less) necessary. Putting together the following RFC statements, this seems reasonable to me.

From the view of URN syntax (RFC-2141) the lettercase of “Namespace Specific String”-Parts (= NSS = UUID) means distinct URNs. But:

Some namespaces may define additional lexical equivalences, such as case-in­sensi­tivity of the NSS (or parts thereof). […]

Functional equivalence is determined by practice within a given namespace and managed by resolvers for that namespeace. Thus, it is beyond the scope of this document. Namespace registration must include guidance on how to determine functional equivalence for that namespace, i.e. when two URNs are the identical within a namespace. [¹]
The internal representation of a UUID is a specific sequence of bits in memory, […]. To accurately represent a UUID as a URN, it is necessary to convert the bit sequence to a string re­presen­tation.

Each field is treated as an integer and has its value printed as a zero-filled hexadecimal digit string with the most significant digit first. The hexadecimal values “a” through “f” are output as lower case characters and are case in­sensi­tive on input. [²]
[¹] RFC-2141, URN Syntax
[²] RFC-4122, A Universally Unique IDentifier (UUID) URN Namespace

Fix ’em !

Should UUIDs of the GEDCOM _UID-tag be fixed? If they are somehow “broken” in a way, that they are likely to be rejected by vendor-applications used to the PAF quasi-standard, maybe due to trivial reasons of flawed, non- or other-standard UUID-formats? That means a loss of the existing and relevant 128-bit value. Its internal re­presen­tation is important for organizing and identify­ing data, the string re­presen­tation is of minor importance.

Further readings:

  • _UID-Tag – by Tamura Jones
  • _UID-Tag – wiki of “genealogy.net” (german)

The “GED_UID.fix.awk” script follows Tamura Jones’ advice, even extends it in accepting more malformed variants as long as a relevant 128-bit value is identifiable. It assumes UUIDs with wrong checksums (especially Ages’ 0000 nonsense suffix) being created with wrong algorithms, rather than suffering from transmission errors. It reads GEDCOM files, checks the _UID-tags, presents replacments in a target-format and – at user’s choice – replaces them, to make some more imports possible w/o UUIDs being rejected.

awk [-v FIX=0|1] [-v STY=<targetformat>] -f GED_UID.fix.awk [<]in.ged [>out.ged]
<targetformat>="_UID"|"GUID"|"UUID"|"XUID"|"UURN"|"XURN"
   "_UID" XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCCCC
          PAF-GEDCOM-_UID 16+2 bytes, 36 chars uppercase hexdigit with checksum
   "UUID" xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
          RFC-4122-UUIDv4 16 bytes, 32+4 chars lowercase hexdigit hyphen-grouped
   "GUID" {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}
          embraced {UUIDv4} 16 bytes, 32+6 chars uppercase hexdigit hyphen-grouped
   "XUID" {xXXxxXxx-xXXx-xXXX-Xxxx-xxxxXXXxXxXx}cccc
          extended mixedcase and -style {GUIDv4}, 4-hexdigit checksum appended
   "UURN" urn:uuid:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
          prefixed lowercase "urn:uuid:UUIDv4" (RFC-4122, UUID as URN)
   "XURN" urn:uuid:xxXxxXXX-xXxx-XXxX-XxXx-xXxxxxxXxXxX+cccc
          extended mixedcase "urn:uuid:UUIDv4+checksum" (RFCs 2141+3986+4122)
    else: XXXxXXXx-XXxx-XXxx-XXxX-xxxxXXXXxXxX cccc
          combined mixedcase UUIDv4 with 4-hexdigit checksum (set apart)


Reverse transformation

The ged1212xml.rev.xsl reverse transformation stylesheet included in the archive isn’t a complete 1:1 return to source. Limitations are:

  1. The output-encoding is unicode (utf-8) and corresponding HEAD/CHAR/VERS tags are changed or omitted.
  2. Loss of whitespace is possible due to normalization or default modes of the XSLT-processor.
  3. “@”s are not transformed to double “@@”s.

The stylesheet is parameterized to change the optional element- and attribute-names according to ged1212xml. Some standard issues and names are already covered:

  1. root and namespace (independent)
  2. elements S|SURN|SURNAME revert to slashed /surname/-part
  3. attributes ID|REF|ESC and xml:id revert to @<XREF>@ or @#<DTOKEN>@
  4. removal of id-prefixes in a colon-ized namespace style, e.g. “nn:” in ID="nn:XREF"


Yet to do ?

  1. GEDCOM-standard:
    Objects within a logical record can be associated. If this need exists, the pointer record composition contains an exclamation point (!) that separates the parent record’s cross-reference ID from the specific substructure’s cross-reference ID, which is at some subordinate level to the logical record at level zero. The cross-reference ID of the substructure subordinate to a zero level record, for inter-record associations is always composed of the Record ID number and the Substructure ID number, such as @I132!1@.

    Simply copying the XREF incl. separator-mark unchanged would again make it invalid as XML attribute-values of type “ID/IDREF”.

  2. GEDCOM-standard:
    All user-defined tags, tags used that have not been defined in the GEDCOM standard, must begin with an underscore character.

    Unknown tags~elements cannot be validated (true?). A workaround could be a special element defined to hold the user-GEDCOM-tag as an attribute-value. Ugly side-effect: user-defined tags may occure in any valid combination of GEDCOM-line elements, meaning the whole code has to be duplicated to catch the “tag-to-attribute” exception (opposite to “tag-to-element” default)?

    [2013-01] A possible solution could be an extra user-namespace for user-defined tags.

    <NN TAG="_USER">…</NN> vs
    <u:_USER xmlns:u="…URI…">…</u:_USER>
    xmlns:u="urn:user:gedcom55XML"
    xmlns:u="urn:xmlns:gedcom55XML+UserTags"
    

Freeware 2nd window for ZIP, ZIP, ZIP, ZIP & PDF, PDF

Gesetzt aus/für Verdana & Courier
2008 ff. ©|© Stefan Unterstein