GEDCOM 1:1 to XML  
  :: nn :: nn ::    
     


ged1212xml: GEDCOM one-to-one to XML

Abstract

Scripts (awk|wsh ~ Javascript for Windows Script Host) to convert GEDCOM-data – GEnealogical Data COMmunication – one-to-one to wellformatted XML; wellformed and valid alike the GEDCOM-source. Even very big GED-files should be convertible requiring only small system-resources. The target-format – defined as “GEDCOM 5.5 XML” elsewhere – enables validation against the GEDCOM 5.5 standard using XML-mechanisms and schemata (RelaxNG + Schematron).

Additional features include conversion of dates to ISO 8601 standard and creation of UUIDs v4 according to RFC 4122.

CONTENTS [javascript:makeTOC()]
Links to other sites [2nd window] target the same one’n’only second window.

About / Motivation

The scripts (“ged1212xml”) provided on this page attempt a close (~100%) one-to-one conversion of GEDCOM-data into a very simple XML, according to a project by 2nd window Chad Albers called “GEDCOM 5.5 XML” (“gedcom55XML”) …

“… all GEDCOM tags are translated into XML elements; open and closed elements delimit the data; and the elements are nested in the same way prescribed by the GEDCOM specification.” (…)
GEDCOM 5.5 XML attempts to be a 100 percent one-to-one translation of GEDCOM 5.5 into XML; it even includes the superfluous (and empty) <TRLR/> element.” [¹]
GEDCOM 5.5 XML differentiates itself from GedML and GeniML because it attempts to replicate the LDS's GEDCOM 5.5 standard using XML markup. Without exception, all GEDCOM 5.5 tags should correspond to XML elements with the same name; all tags should be preserved; the parent-child relationships between the tags and elements should parallel one another; and all data delimited by the elements should fall within the strict guidelines of the standard.” [²]
[¹] Chad Albers at 2nd window neomantic.com/gedcom55XML
[²] cf. the 2nd window README notes on the competing GedML & GeniML for his gedcom55XML approach.
  • more semantics
  • more data-structures
  • more cross-references
  • more encodings
  • more extendable
  • more readable (source)
  • more free tools
  • more …, &c

Searching the web for GEDCOM & XML is rather disappointing nowadays, except for a very few sites, that are surprisingly up to date and still active on this topic. Most projects during the past “GEDCOM to XML”-hype made large promises when using all the XML capabilities: ›››

All true, all possible. But equally most projects remain drafts, and seem to have been abandoned. While even the task of a 1:1 translation isn’t really completed.

Installing/running “big” genealogical software (mostly shareware) to simply import GEDCOM (with hidden loss of data?) and export to some XML (more data loss?) is a fussy and risky way to get all already collected genealogical data available for further xml-processing. Filter-action (import to export) is not the task such software is made for. The background history (s. blog-entry 2nd window Pandora’s Box) of Tim Forsythe’s 2nd window VGed (former “GEDCOM Validator”) tells more of the whole story.

Technologies go by, data stay. Data + Format (as a generic markup, attaching computable semantic to data) represent the real hard efforts made by humans on research, structure and validity. They must be preserved under all circumstances of changing technology. Loss of data, de-structuring (makes data less read- and usable for humans and machines), and opaque formats are the worst faults. They result in data-cemeteries that have to be human touched and checked again and again. Content without open-standards markup is dead, uncomputable, until reanimated by human review.

So: why this fallback to a seemingly simplistic, half buried approach? To a project that Michael H. Kay began with 2nd window GedML as the pioneer he is in many respects, and that everyone everywhere refers to? What are the remaining benefits?

First of all it’s a simple format, easy to create, easy to control, and it’s not new. Efforts made before (XSLT stylesheets) may be reused with only minor changes. Roots to the established GEDCOM 5.5 standard – that XML was never successful to replace or become heir to – aren’t cut off. To get your hands on the genealogical data is straight forward, and the next step of processing can already be done with XML/XSL-tools, e.g. transforming it into more ambitious XML dialects. For this – in despite of unfortunately using an own intermediate XML-format – you may cf. Bill Kinnersley’s worth reading 2nd window GEDC documentation, his XML-based standard and application.

Not at least – see Chad Albers’ approach with RelaxNG/Schematron – structure and data-types of a GED-file can be validated (using XML-mechanisms) against the GEDCOM 5.5 standard, if they remain nearly unchanged. The scripts aim to be a possible replacement of the first step in his workflow (ongoing to XSL-FO formatting-objects and PDF).

Script-, GEDCOM-, and XML-Gurus: Interested in testing or even using this?
Please let me know about the good, the bad, and the ugly things. You are welcome.

Download

History

  1. [2008-09-18] – pre-release (testers)
  2. [2008-10-01] – initial release (public)
  3. [2008-10-11]
    • option added to differ slashed from tagged surname-parts by node-naming
    • XML-output additionally formatted with blank lines for easier “visual parsing”
    • GEDnoopp.awk added to archive to (un-)format GED-files likewise, as shown below
  4. [2008-11-20]
    • ged1212xml.rev.xsl XSLT-stylesheet added for rudimentary reverse transformation
  5. [2011-01-20]
    • fixed a logical ambiguity in evaluating the surname-script-option
    • added “Universally Unique IDentifiers” v4 (UUID) conforming to 2nd window RFC 4122.
    • added (pseudo-xmlns) prefixes – though functional meaningless – for PIs too.
  6. [2011-01-22]
    • fixed XML/namespace-spec violations: NMTOKEN vs NCName (2nd window)
    • (NCName represents XML “non-colonized” Names – a no-colon constraint)
    • changed PI-target-prefixes (now avoid colons!) from default to an option
    • renamed some options concerning namespaces/prefixes
  7. [2011-02-02]
    • complete renewal of processing-instruction handling:
    • predefined attribute- and function-PI-styles; “prefixes” replaced by name-config; etc.
    • ged1212xml.PI.xsl added to archive: a stylesheet-skeleton to tune PIs (or more)

Archive Contents

ged1212xml.awkged1212xml.awk.htm
awk-script to translate (hopefully) any GEDCOM file one-to-one to XML. It is “stand alone”, i.e. ANSEL-to-Entity routines are already included.
ANSELentify.awk
ANSEL-to-Entity as an awk-script of its own. Do not run before ged1212xml! All the entities included would be deactivated through an ampersand-translation to &amp;, and that’s not what intended.
ANSELentify.sed
ANSEL-to Entity as sed-script. Just another offspring for the “Stream EDitor”.
ged1212xml.wsfged1212xml.wsf.htm
Javascript to be run in the “Windows Script Host” (WSH) engine. Varying from above, this is not “stand alone”, but imports (i.e. requires) ANSELentify.js at runtime.
ANSELentify.js
ANSEL-to-Entity code imported + executed by ged1212xml.wsf.
GEDnoopp.awk
GEDCOM normalize or pretty print” – format a GED-file with indents and blank lines (visually group records) in a first run; “normalize” (remove, undo) the formattings according to standard in a second run to restore its validity.
ged1212xml.rev.xslged1212xml.rev.xsl.htm
Reverse Transformation Stylesheet (XML back to GEDCOM). Not quite 1:1, but usable “cum grano salis”. For limitations see remarks below.
ged1212xml.PI.xsl
A stylesheet-skeleton to adjust the style of PIs in a generated XML-file. All other nodes are passed through. It’s meant as a starting point for users, when the scripts’ PI-configuration is not satisfactory. Maybe handy for other kinds of transformation too.
ANSELentify.* files heavily depend on (conversions of) 2nd window “ans2uni.con” (ZIP) and I owe many thanks to 2nd window “Heiner Eichmann’s GEDCOM 5.5 Sample Page: ANSEL to Unicode conversion” and his 2nd window ANSEL to Unicode Conversion Tool”
UUID-js-code ist derived and varied from Robert Kieffer’s 2nd window “Math.uuid.js” (JS) at his 2nd window Broofa Blog-Page. Thanks! It inspired my awk-code too.

Preview GEDCOM-, Script-, XML-Sources

The source/code-previews are simple HTML-exports from the Scintilla Text-Editor 2nd window SciTE.

Siebold’s GEDCOM is indented and foldable just for readability. Indented lines – any leading whitespace! – and empty lines do not conform to the GEDCOM-specification, but ged1212xml tolerates and ignores it.

  1. ged1212xml.awk.htm – awk-code
  2. ged1212xml.wsf.htm – JavaScript/WSH-code
  3. ged1212xml.rev.xsl.htmXSLT-code for reverse transformation
  4. siebold.GED.htm – Philipp Franz von Siebold’s GEDCOM formatted by GEDnoopp
  5. siebold.GED.xml.htm – Philipp Franz von Siebold’s gedcom55XML made by ged1212xml
  6. Ged2HTML Web-Presentation – Philipp Franz von Siebold’s lineage made by Ged2HTML

Get precompiled Win32 awk binaries/variants

  • gawk – GnuWin32.sourceforge.net
  • mawk – GnuWin32.sourceforge.net
  • nawk – GnuWin32.sourceforge.net

I am no real fan of utilities and runtimes with a high system impact (like Java or DotNET), nor of programming-languages requiring big downloads (like Perl, Python, Ruby, etc.). They may however be better suited to elegant solutions.

So I tried with “awk” and with “Windows Script Host” (WSH + Javascript). The latter is available on all MS-Windows, “awk” is standard on all Linuxes/Unixes and for Win32-users a download of negligible size and just unpacking a single binary/executable file (no install, no setup).

Get more GEDCOM files/infos


Usage of ged1212xml.awk

ged1212xml.awk
USAGE: [g|m|n]awk [-v var=value [-v …]] -f ged1212xml.awk 
       [<]infile.GED [>outfile.XML] [2>error.LOG]
NOTES: v-Options are required to be set before f-Options
       NCName represents XML "non-colonized" Names (no-colon constraint)
OPTIONS:
  -v ANSEL=0|1
      start "ANSEL to Entity"-mode before 1st occurence of +n CHAR ANSEL
  -v nsPFX=""|<ncname>
      xml namespace prefix, requires setting of nsURI too, default=none
  -v nsURI=""|<uri>
      xml namespace URI for xmlns[:nsPFX]="…", default=none
  -v xmlEnc="iso-8859-1"|<encoding>
      replace xml declaration's default <?xml … encoding="iso-8859-1"?>
  -v xmlStyle=""|<file.css|file.xsl>
      insert processing-instruction <?xml-stylesheet href="…"?>, default=none
  -v xmlRoot="GED"|<ncname>
      replace root-element's default tag-name "GED"
  -v xmlID="ID"|"xml:id"|<ncname>
      replace attribute-name's default "ID" for GEDCOM's @<XREF>@s
  -v xmlIDREF="REF"|<ncname>
      replace attribute-name's default "REF" for GEDCOM's @<XREF>@s
  -v xmlDTD=""|<file.dtd>
      insert doctype-definition <!DOCTYPE … SYSTEM "…">, default=none
  -v xsiXSD=""|<file.xsd>
      insert root's xsi:XMLSchema-instance-location-definition, default=none
  -v idPFX=""|"id."|"ged-"|<ncname>
      ID-prefix for valid xmlID/REF-values (NCNames), default=none
      ID-prefix == string-additive, don't confuse it with namespace-prefixes!
  -v escDATE=""|"ESC"|<ncname>
      given name ("ESC" preferred, default=none=noop) 
      moves @#<DATE_CALENDAR_ESCAPE>@s into attributes
  -v surNAME="SURN"|"S"|<ncname>|<!ncname>
      alter node-name ("S" preferred, default="SURN") for slashed surname-part
      to avoid double SURN-subnodes in an extended NAME-node/structure
      a non-ncname char/string prevents slash-replacement at all
  -v piSTY=""|"attr"|"func"|"void"|"nopi"
      predefined attribute- or function-style for processing-instructions
      default="void" ~ empty for user-defined styles, otherwise plain style
      a non-defined value (like "nopi") prevents PI-generation at all
  -v piNCN=""|<ncname>
      PI-ncname for processing-instruction-targetnames, default=none
  -v datePI="DATE"|<ncname>|<!ncname>
      processing-instruction-targetname, default="DATE" becomes <?DATE ...?>
      date-format converted (if possible) to YYYY-MM-DD according to ISO 8601
      a non-ncname char/string prevents DATE PI-generation
  -v uuidPI="UUID"|<ncname>|<!ncname>
      processing-instruction-targetname, default="UUID" becomes <?UUID ...?>
      Universally Unique IDentifiers v4 (pseudo-random) according to RFC 4122
      a non-ncname char/string prevents UUID PI-generation
  -v uuidSEED=<integer>
      default=srand()
-v RS="\r"
Two GED-files in the “GEDCOM 5.5 Torture Test” package end lines in a single carriage return. The option sets awk’s “input Record Separator” (a builtin-variable) to this variant of linebreaks.
example: cromwell.cfg.awk
BEGIN {
    ANSEL   = 1 ;
    nsPFX   = "g" ;    # w/o colon!
    nsURI   = "urn:xmlns:gedcom55XML" ;
    xmlID   = "xml:id" ;
    idPFX   = "gid." ; # w/o colon!
    surNAME = "S" ;
    escDATE = "ESC" ;
}
“awk” allows multiple f-options. Users can collect all v-options specific for a project in a configuration-file. It’s an overuse example, but for the sake of demonstration … a possible result of an INDI-node/record (as exported by “Heredis”):
awk -f cromwell.cfg.awk -f ged1212xml.awk cromwell.ged > cromwell.ged.xml
0 HEAD
  1 SOUR HEREDIS 7 PC
...
0 @221I@ INDI
  1 NAME Sir Oliver/CROMWELL/
    2 GIVN Sir Oliver
    2 SURN CROMWELL
  1 SEX M
  1 BIRT
    2 DATE @#DJULIAN@ 1563
  1 DEAT
    2 DATE 1655
  1 FAMS @317U@
  1 FAMS @227U@
  1 FAMC @204U@
...

<g:GED xmlns:g="urn:xmlns:gedcom55XML">
...
<g:INDI xml:id="gid.221I">
  <g:NAME>Sir Oliver<g:S>CROMWELL</g:S>
    <g:GIVN>Sir Oliver</g:GIVN>
    <g:SURN>CROMWELL</g:SURN>
  </g:NAME>
  <g:SEX>M</g:SEX>
  <g:BIRT>
    <g:DATE ESC="DJULIAN">1563</g:DATE><?DATE 1563-00-00?>
  </g:BIRT>
  <g:DEAT>
    <g:DATE>1655</g:DATE><?DATE 1655-00-00?>
  </g:DEAT>
  <g:FAMS REF="gid.317U"/>
  <g:FAMS REF="gid.227U"/>
  <g:FAMC REF="gid.204U"/>
</g:INDI><?UUID 82ce4049-26F0-4E9a-AF0E-CA94356Ff680?>
...
The XML-structure is namespaced and prefixed. XREFs are made valid ID/IDREF-values in a similar way (gid.-prefix fakes a namespace just to start with a letter). The xml:id-attribute-name introduces its value being content-type of “ID” to a capable parser even without DTD/Schema. The date-calendar-escape is moved into an ESC-attribute. As a side-effect this enables a transformation of the date to ISO standard format, appended as a processing-instruction. The slashed surname-part (now <g:S>-node) differs from the tagged surname-part (<g:SURN>-node). A <?UUID…?>-pi appended after the closing <g:INDI>-node tag waits for utilization.

Usage of ged1212xml.wsf (JavaScript with WSH)

ged1212xml.wsf (imports ANSELentify.js)
USAGE: cscript //nologo ged1212xml.wsf [/name:value […]] 
       [/ged:infile.ged] [/xml:outfile.xml] [/log:error.log]
       [<stdin.ged] [>stdout.xml] [2>stderr.log]
NOTES: double slashes for cscript-arguments, e.g. //nologo, 
       single slashes for wsf-arguments, as below
       NCName represents XML "non-colonized" Names  (no-colon constraint)
OPTIONS:
  FILES
    /ged:<file.ged>
      GEDCOM input-filename, default=STDIN
    /xml:<file.xml>
      XML output-filename, default=STDOUT
    /log:<file.log>
      Logging output-filename, default=STDERR
  GED-INPUT-ENCODING-MODE
    /ans:true
      start "ANSEL to Entity"-mode before 1st occurence of +n CHAR ANSEL
  XML-OUTPUT
    /nspfx:<ncname>
      xml namespace prefix, requires setting of /uri:<URI> too, default=none
    /nsuri:<URI>
      xml namespace URI for xmlns[:nsPFX]="…", default=none
    /enc:"iso-8859-1"|<encoding>
      replace xml declaration's default <?xml … encoding="iso-8859-1" ?>
    /sty:<file.css|file.xsl>
      insert processing-instruction <?xml-stylesheet href="…"?>, default=none
    /root:"GED"|<ncname>
      replace root-element's default tag-name "GED"
    /id:"ID"|"xml:id"|<ncname>
      replace attribute-name's default "ID" for GEDCOM's @<XREF>@s
    /ref:"REF"|<ncname>
      replace attribute-name's default "REF" for GEDCOM's @<XREF>@s
    /dtd:""|<file.dtd>
      insert doctype-definition <!DOCTYPE … SYSTEM "…">, default=none
    /xsd:""|<file.xsd>
      insert root's xsi:XMLSchema-instance-location-definition, default=none
    /idpfx:""|"id."|"ged-"|<ncname>
      ID-prefix for valid xmlID/REF-values (NCNames), default=none
      ID-prefix == string-additive, don't confuse it with namespace-prefixes!
    /esc:""|"ESC"|<ncname>
      given name ("ESC" preferred, default=none=noop) 
      moves @#<DATE_CALENDAR_ESCAPE>@s into attributes
    /sur:"SURN"|"S"|<ncname>|<!ncname>
      alter node-name ("S" preferred, default="SURN") for slashed surname-part
      to avoid double SURN-subnodes in an extended NAME-node/structure
      a non-ncname char/string prevents slash-replacement at all
    /pi:""|"attr"|"func"|"void"|"nopi"
      predefined attribute- or function-style for processing-instructions
      default="void" ~ empty for user-defined styles, otherwise plain style
      a non-defined value (like "nopi") prevents PI-generation at all
    /pincn:""|<ncname>
      PI-ncname for processing-instruction-targetnames, default=none
    /datepi:"DATE"|<ncname>|<!ncname>
      processing-instruction-targetname, default="DATE" becomes <?DATE ...?>
      date-format converted (if possible) to YYYY-MM-DD according to ISO 8601
      a non-ncname char/string prevents DATE PI-generation
    /uuidpi:"UUID"|<ncname>|<!ncname>
      processing-instruction-targetname, default="UUID" becomes <?UUID ...?>
      Universally Unique IDentifiers v4 (pseudo-random) according to RFC 4122
      a non-ncname char/string prevents UUID PI-generation
    /seed:noop
      built-in random() not seedable in ECMAScript

Special behaviour and features

ANSEL-to-Entity

About ANSEL-processing: by default it only switches on with the first occurrence of a GEDCOM header-line "+n CHAR ANSEL" – and maybe off again, if a similar line announces another encoding. In a situation where ANSEL is used even before in header-text, the scripts provide an option to care of ANSEL from the very beginning.

Processing-Instructions

This is experimental. Think of gedcom55XML as a minimal prescription that can be enhanced with additional nodes (elements and attributes in other namespaces) or by-side-instructions.

Analysing or transforming text (text-node-values) with XSLT presents difficulties. In places where the scripts can do better, a processing-instruction (“PI”) with additional results may be named and inserted immediately after the closing tag of an element. PIs are not defined in a DTD or Schema nor do violate any validation, but can easily accessed with XSLT.

PI: DATE

Currently it is done if a valid english date form is available in "+n DATE <DATE_EXACT>" lines. In this case the scripts append an ISO-form of the date as …</DATE><?DATE yyyy-mm-dd?>.

In other words, as a general rule: a PI should contain just another representation – e.g. prepared according to standards and usability – of the preceding element’s value.

PI: UUID

For whatever it may usefull – “A UUID is 128 bits long, can guarantee uniqueness across space and time, and requires no central registration process.” – all level zero nodes/records "0 @<XREF>@ TOKEN" get a <?UUID xxxxxxxx-xxxx-4xxx-Yxxx-xxxxxxxxxxxx?> appended by default after the closing tag. Some genealogical programs already generate and use non-standard but lookalike UUID in non-standard usermade GEDCOM-tags like "_UID".

Diverging from the RFC 4122 UUID-spec – “The hexadecimal values "a" through "f" are output as lower case characters and are case insensitive on input.” – the output of randomly mixed-case letters enhances the syntactic scope and thus raises uniqueness where sensitive syntax matters (e.g. xml:id). Users can normalize UUIDs to lower/upper case or keep the difference.

Style of PI
<?target value?>
General format of processing-instructions; target must be a “non-colonized” name (NCName)
<?DATE yyyy-mm-dd?>
<?UUID xxxxxxxx-xxxx-4xxx-Yxxx-xxxxxxxxxxxx?>
BEGIN {
    piSTY  = "void" ;  # all default
    datePI = "DATE" ;
    uuidPI = "UUID" ;
}
Default settings (only decl. for demo) include all PIs “plain styled”, i.e. PI-target denotes PI-value being just the “flat” generated value. — Any undefined piSTY-token disables all PIs, or a non-NCName disables the targeted PI only.
awk -v piSTY=nopi ...
or  -f config.awk ...
BEGIN {
    piSTY  = 0 ; # disables all PIs
    # or some/one of … selected PIs
    # datePI = 0 ;
    # uuidPI = 0 ;
}
<?GED DATE="…value…"?>
<?GED UUID="…value…"?>
BEGIN {
    piSTY  = "attr" ;
}
PI-target defaults to XML-Root-Element (name’s local part); PI-value uses predefined attribute-style/names
<?GedCom DATE("…value…");?>
<?GedCom UUID("…value…");?>
BEGIN {
    piSTY    = "func" ;
    xmlRoot  = "GedCom" ;
}
PI-target mirrors xmlRoot; PI-value uses predefined function-style/names
<?app isoDate("…value…");?>
<?app uuid_v4("…value…");?>
BEGIN {
    piNCN  = "app" ;  # target ~ NCName w/o colons required!
    piSTY  = "func" ;
    datePI = "isoDate" ;
    uuidPI = "uuid_v4" ;
}
PI-target set to an application; PI-value uses predefined function-style and assumed function-names
<?app ged_date("I83336","DEAT","…value…");?>
<?app ged_uuid("I83336","…value…");?>

sabcmd ged1212xml.PI.xsl the.GED.xml

XSLT can “Pimp My PI”, e.g. include related data (preceding/parent ID, parent node name) as additional function arguments, and of course much more PIs, nodes and data – or transform e’thing to s’thing completely different.

Date calendar escapes

DATE-line again, but no decision yet: what to do with the 2nd windowdate-calendar-escape” sequences? Any special treatment, or is it just like any other value? According to IDs and REFs, a sequence enclosed in “@” (regular expression: /@#D(GREGORIAN|JULIAN|HEBREW|FRENCH R|ROMAN|UNKNOWN)@/) and having a comparable meta-aspect should be moved into an attribute, e.g. <DATE ESC="DTOKEN">. The scripts provide an option for testing.

BTW: the possible whitespace inside the "FRENCH R"-token/pattern (French Revolutionary Calendar) is annoying. Under certain circumstances the sequence is split into seperate fields and requires another extra exception to be handled.

Slashes to “S” vs “SURN” etc

Slashes delimit and mark the surname-part (like /surname/) of a NAME-structure. By default they are converted to a SURN-subnode, despite there is no hint or convention for the node-naming. A problem may occure, if – according to the standard: optionally and (!) additional – a SURN-tag is present too and therefore doubles the SURN-node. To avoid this, or to prevent a slash-replacement at all, the node-name can be altered by an option, preferably to “S” of Kay’s GedML. A non-Non-Colonized-Name char/string (not type of “NCName”) will switch off any replacement and leave the slashed surname-part unchanged.

GeniML replication

Another usage might be to copy the element-naming of similar GEDCOM/XML-approaches. A configuration like this …

GeniML.cfg.awk   [included in ged1212xml.zip]
BEGIN {
    xmlRoot  = "GENIML" ;
    xmlStyle = "pedigree.xsl" ;
    surNAME  = "SURNAME" ;
    uuidPI   = 0 ; # no UUIDs
}

… replicates J.Fitzpatrick’s “GeniML” (2nd window) to apply his pedigree stylesheet in a second step:
siebold.GeniML.htm – Siebold’s data transformed by pedigree.xsl.

Valid XML ID/IDREF-values

Two problems to solve. (1) Some genealogical programs create @<XREF>@s with leading digits, conforming to GEDCOM, but not to XML attribute-values of type “ID/IDREF”. (2) IDs must remain unique, even if an application or transformation populates the XML-file with non-GED IDs for other purposes.

None of the problems is critical. Attributes named “ID” or “REF” need not to be type of “ID/IDREF”. Just the ID-mechanisms provided by XML aren’t usable as usual. It is up to you whether a fallback to key- and string-comparison is a flaw or not. Equally use your own algorithm to keep additional IDs unique.

The scripts introduce an option to get around another way: define a string (e.g. “id.” or a namespace-prefix lookalike “ged-”) that precedes all XREF-values. Doing this right makes IDs valid (letter-character first!) and forms a unique group of IDs/REFs originating from the GEDCOM-source. A valid reserved/special character as separator (dot or dash, no colon!) makes getting rid of any prefix an easy task. Some pseudo-codes to rebuild the XREF:

  • XREF = ID.split(".").pop()
  • XREF = ID.substr(idPFX.length)
  • XREF = substr(ID, match(ID,"."))
  • XREF = substr(ID, length(idPFX))
  • <xsl:variable name="XREF" select="substring-after(@ID,'.')"/>
  • ...

Whitespace delimiters

As already mentioned above: GEDCOM lines with leading whitespace (due to indenting) and delimiters consisting of more than exactly one whitespace are tolerated (condensed). Empty lines are ignored. But keep in mind, that such files do not conform to the GEDCOM 5.5 specification.

Broken code

In general: there are no spec-checks! Lines that don’t fit to the (not so strict) patterns are ignored and reported. If they pass as a “false positive” or fail as a “false negative”, this may result in a non well-formed XML. It’s the XML-parser’s task to check this. But of course it’s my turn to improve the patterns and scripts. Please contact me, if the results of conversion are not satisfactory!



XML Excurses

Colon in PI targets ?

It seems worth to repeat and thus spread this commonly unknown (me too, prev.) constraint for the Name-production of PIs/IDs: beware of the colon …

> The question is, is [colon in PI targets] allowed, and if not, why?

Not allowed, because colons are allowed only in element and attribute names. There is a note near the end of XML-Namespaces [2nd window] explaining this. (Of course they are allowed in vanilla XML 1.0 [2nd window] without namespaces, but they are discouraged, unless you are experimenting with a namespace mechanism other than XML-Namespaces.)

The mechanism for mapping PI targets to URIs is a notation declaration, as explained in XML-Rec. [¹]
Colons are not allowed in PI names *for namespace processing*; they certainly are allowed in XML 1.0.

If it is meant to be XML 1.0 conformant, it should allow colons when you are not performing namespace processing (though it's probably a good idea not to use them anyway).

Others may correct me, but I don't think that a conformant processor is allowed to reject well-formed XML (expect, perhaps, if the processing is performing validation and the document is not valid). Namespaces cannot change this, since XML 1.0 is an independent standard. [²]
[¹] XML-DEV (2nd window) of May 19, 1999
[²] XML-DEV (2nd window) ditto

Colon in ID values ?

Once again: beware of the colon …

It follows that in a namespace-well-formed document:
  • All element and attribute names contain either zero or one colon;
  • No entity names, processing instruction targets, or notation names contain any colons.
In addition, a namespace-well-formed document may also be namespace-valid.
Definition: A namespace-well-formed document is namespace-valid if it is valid according to the XML 1.0 specification, and all tokens other than element and attribute names which are REQUIRED, for XML 1.0 validity, to match the XML production for Name match this specification’s production for NCName.
It follows that in a namespace-valid document:
  • No attributes with a declared type of ID, IDREF(S), ENTITY(IES), or NOTATION contain any colons.

[¹] Namespaces in XML 1.0: 2nd window Conformance of Documents
[²] cf. XML Schema Part 2 Datatypes: 2nd window ID


Reverse transformation

The ged1212xml.rev.xsl reverse transformation stylesheet included in the archive isn’t a complete 1:1 return to source. Limitations are:

  1. The output-encoding is unicode (utf-8) and corresponding HEAD/CHAR/VERS tags are changed or omitted.
  2. Loss of whitespace is possible due to normalization or default modes of the XSLT-processor.
  3. “@”s are not transformed to double “@@”s.

The stylesheet is parameterized to change the optional element- and attribute-names according to ged1212xml. Some standard issues and names are already covered:

  1. root and namespace (independent)
  2. elements S|SURN|SURNAME revert to slashed /surname/-part
  3. attributes ID|REF|ESC and xml:id revert to @<XREF>@ or @#<DTOKEN>@
  4. removal of id-prefixes in a colon-ized namespace style, e.g. “nn:” in ID="nn:XREF"


Yet to do ?

  1. GEDCOM-standard:
    Objects within a logical record can be associated. If this need exists, the pointer record composition contains an exclamation point (!) that separates the parent record’s cross-reference ID from the specific substructure’s cross-reference ID, which is at some subordinate level to the logical record at level zero. The cross-reference ID of the substructure subordinate to a zero level record, for inter-record associations is always composed of the Record ID number and the Substructure ID number, such as @I132!1@.

    Simply copying the XREF incl. separator-mark unchanged would again make it invalid as XML attribute-values of type “ID/IDREF”.

  2. GEDCOM-standard:
    All user-defined tags, tags used that have not been defined in the GEDCOM standard, must begin with an underscore character.

    Unknown tags~elements cannot be validated (true?). A workaround could be a special element defined to hold the user-GEDCOM-tag as an attribute-value. E.g. <NN TAG="_USER">…</NN>. Ugly side-effect: user-defined tags may occure in any valid combination of GEDCOM-line elements, meaning the whole code has to be duplicated to catch the “tag-to-attribute” exception (opposite to “tag-to-element” default)?


Freeware 2nd window for ZIP, ZIP, ZIP, ZIP & PDF, PDF

Gesetzt aus/für Verdana & Courier
2008 ff. ©|© Stefan Unterstein