Menu

#2 reinstating nested categories

open
5
2014-08-18
2004-10-14
No

Please see full text of the proposal in the accompanying
attachment. Thank you.

Discussion

1 2 > >> (Page 1 of 2)
  • Sanda Ionescu

    Sanda Ionescu - 2004-10-14

    Proposal for Reinstating Nested Categories in the DDI

     
  • Sanda Ionescu

    Sanda Ionescu - 2004-10-15
    • assigned_to: nobody --> sandai
     
  • Tom Piazza

    Tom Piazza - 2004-10-18
    • labels: 608044 --> 608046
    • milestone: --> 374394
    • assigned_to: sandai --> wlthomas
     
  • Sanda Ionescu

    Sanda Ionescu - 2004-10-18
    • labels: 608046 --> Validating Technical Revision
    • milestone: 374394 --> [SRG] Structural Reform Group
     
  • I-Lin Kuo

    I-Lin Kuo - 2004-10-22

    Logged In: YES
    user_id=298249

    IK:
    There are essentially 3 parts to this proposal:
    - markup of type A hierarchical categories
    - markup of type B hierarchical categories
    - specification of <catLevel>

    For type B hierarchical categories, I recommend that it be
    accepted as is. It covers the desired use case in a simple
    and easily understood manner.

    For type A hierarchical categories, I find the use of IDREFs
    technically unnecessary, though it probably fits the
    sponsors' data. As such, it may yet be adopted as an interim
    measure in 2.0. However, I have submitted a proposal for
    hierarchical categories for 3.0, and I feel that the SRG
    should ask the sponsors to adopt my recommendation for Type
    A markup for 3.0.

    This specification of <catLevel>, as it currently stands,
    may be of only limited use due to irregularity of the
    hierarchy. For example, suppose we have a study with
    detailed information about the U.S. and less detailed
    information about other countries. A hierarchy might be the
    following:

    -- United States
    -- Alaska
    -- Anchorage
    -- Michigan
    -- Detroit
    -- Lansing
    -- Flint
    -- Canada
    -- Toronto
    -- Vancouver

    In this example, there is clearly a country level and a city
    level. While we could specify <catLevel levelnm="countries"
    levelno="1"/>, we would have a problem with <catLevel
    levelnm="cities" levelno=?/> because the level number for
    cities is not well-defined due to the irregularity. Yet, it
    is clear that a cities level does exist.

    This type of hierarchy should be fairly common.

    I would recommend that the <catLevel> specification be sent
    back to the author/sponsors for further revision, and that
    adoption be delayed until the above example can be accomodated.

     
  • Sanda Ionescu

    Sanda Ionescu - 2004-10-22

    Logged In: YES
    user_id=1134872

    With regard to I-Lin's comment, I need to correct a
    misunderstanding: "Type A" category groupings, as
    exemplified in my proposal, are NOT hierarchies - that is a
    central point of my argumentation.
    Also, markup of "Type A" groupings is NOT part of my
    proposal - this type of markup is ALREADY ENABLED in the
    DDI, in fact right now it is the only possible way to mark up
    any kind of groupings.
    I am only using it as an example to show that this kind of
    markup is inappropriate for hierarchies.
    It is incorrect to suggest that this type of markup "may be
    adopted as an interim measure" in V 2.0, because it is already
    in V 2.0. What I'm suggesting is to merely leave it in for the
    time being.

     
  • Pascal Heus

    Pascal Heus - 2004-11-03

    Logged In: YES
    user_id=918403

    A few thoughts and comments:

    1) For the purpose of nesting categories, I would rather for
    now keep using the catGrp attribute of the catgryGrp
    element. This is consistent with variable grouping and we
    should have one coherent methodology across the DDI.
    Taking a different approach may create confusion.
    Furthermore, removing the catGrp element from the
    specification in a minor revision implies that all existing
    documents must be upgraded and that the XSL transforms
    and application need to be revised to take both cases into
    account. I actually like the idea of nesting but this may need
    to wait for the major 3.0 version.

    2) Im not particularly in favor of numbering levels, this may
    be difficult to maintain and can lead to inconsistencies. Im
    more in favor of having the application determine out what
    the levels are. I would instead rather add a universe
    or definition attribute to the catgryGrp element (thought the
    text element could already be used)

    3) Another use case that should be taken into account by
    this proposal is the one we briefly discussed during the SRG
    meeting when a variable can have multiple sets of categories.
    A classic case is age which can be a continuous variable but
    is also often classified several age groups. If you take for
    example the US Census, youll find age categorized in Age in
    5 year group, Age in 10 year groups and Age 10-54 in 5
    years groups. Income groups can be used as another case.
    This functionality is supported by the free CSPro software
    (http://www.census.gov/ipc/www/cspro/) used by many
    national statistical offices for data entry and processing.

    4) Looking at the case B, could it be described as follows by
    defining groups and categories with the same label? This does
    not require nesting and each category can have its statistics.

    <catgryGrp ID="G0" catgry="C1"
    catGrp="G1">Occupation</catgryGrp>
    <catgryGrp ID="G1" catgry="C2 C5 C6 C7 C10 C11"
    catGrp="G2 G7 G11">Management, professional and related
    occupations</catgryGrp>
    <catgryGrp ID="G2" catgry="C3 C4">Management
    occupations</catgryGrp>
    <catgryGrp ID="G7" catgry="C8 C9">Architecture and
    engineering occupations</catgryGrp>
    <catgryGrp ID="G11" catgry="C12 C13">Education, training
    and library occupations</catgryGrp>
    <catgry ID="C1">
    <catValu>1</catValu>
    <labl>Management, professional and related
    occupations</labl>
    </catgry>
    <catgry ID="C2">
    <catValu>2</catValu>
    <labl>Management occupations</labl>
    </catgry>
    <catgry ID="C3">
    <catValu>3</catValu>
    <labl> Top executives</labl>
    </catgry>
    <catgry ID="C4">
    <catValu>4</catValu>
    <labl> Financial managers</labl>
    </catgry>
    <catgry ID="C5">
    <catValu>5</catValu>
    <labl> Business and financial operations occupations
    </labl>
    </catgry>
    <catgry ID="C6">
    <catValu>6</catValu>
    <labl> Computer and mathematical
    occupations</labl>
    </catgry>
    <catgry ID="C7">
    <catValu>7</catValu>
    <labl> Architecture and engineering occupations
    </labl>
    </catgry>
    <catgry ID="C8">
    <catValu>8</catValu>
    <labl>Architects</labl>
    </catgry>
    <catgry ID="C9">
    <catValu>9</catValu>
    <labl>Engineers</labl>
    </catgry>
    <catgry ID="C10">
    <catValu>10</catValu>
    <labl> Legal occupations</labl>
    </catgry>
    <catgry ID="C11">
    <catValu>11</catValu>
    <labl> Education, training and library occupations
    </labl>
    </catgry>
    <catgry ID="C12">
    <catValu>12</catValu>
    <labl>Teachers</labl>
    </catgry>
    <catgry ID="C13">
    <catValu>13</catValu>
    <labl>Librarians</labl>
    </catgry>

     
  • Sanda Ionescu

    Sanda Ionescu - 2004-11-03

    Logged In: YES
    user_id=1134872

    My response to Pascal's comments:
    1) I am NOT proposing removal of the catGrp element. My
    proposal is validating. The catGrp element will continue to be
    used for conceptual groupings (non-hierarchical).
    Unfortunately, it cannot be used for hierarchical groupings -
    see 4) below.

    2)Any comments from Nesstar? They requested level naming
    and numbering for building tables.

    3) I think this is a goal for Version 3.0. The current proposal
    is for minor, validating change to Version 2.0

    4) In Pascal's markup example there is nothing that shows
    that categories 3 and 4 (G2) are in fact subordinate to
    category 2, and none other. In other words, G2 needs to be
    linked directly to C2. In Pascal's example, G2 is linked to G1,
    but NOT to C2. We need to be able to indicate exactly how
    lower-level categories link up to higher level categories,
    otherwise the hierarchy cannot be recreated.

     
  • Wendy Thomas

    Wendy Thomas - 2004-11-03

    Logged In: YES
    user_id=979766

    The current catgryGrp structure allows for creating
    hierarchies through nesting by allowing for a catrgryGrp to
    include one or more catgryGrp ID's as well as catgry ID's. The
    problem, as I understand it, arises when data for multiple
    levels of categories are presented as a single series (for
    example geography, industrial codes, or occupation codes.

    At present (in terms of the nCube structure) you have two
    options. One, is to describe both levels separately and use
    matching labels on the the equivilent catgry's in the upper
    level description and the catgryGrp label of the more detailed
    level. While this allows for data to be reported for all
    categories regardless of level, and for the detailed levels to
    be related to their larger groups, there is no mechanism for
    clearly identifying those equivilency relationships between the
    catgryGrp of one var and the catgry of another var.

    The other option is to make a single variable with all
    categories listed as equivilents...but you lose the nesting
    pattern.

    If you then use the catgryGrp element to indicate groupings
    of level 2 or level 3 categories you still are not addressing
    those categories that serve as BOTH a category with a data
    item attached to it and as a subgroup heading.

    Example:

    <var>
    <labl>Households</labl>
    <catgryGrp ID="CG1" catGrp="CG2" catgry="C2" level="1">
    <label>Households</labl>
    </catgryGrp>
    <catgryGrp ID="CG2" catgry="C4 C5 C6" level="2">
    <label>Family Households</labl>
    </catgryGrp>
    <catgry ID="C1">
    <catValu>1</catValu>
    <labl>Households</labl>
    </catgry>
    <catgry ID="C2">
    <catValu>2</catValu>
    <labl>Nonfamily Households</labl>
    </catgry>
    <catgry ID="C3">
    <catValu>3</catValu>
    <labl>Family Households</labl>
    </catgry>
    <catgry ID="C4">
    <catValu>4</catValu>
    <labl>Married Couple Household</labl>
    </catgry>
    <catgry ID="C5">
    <catValu>5</catValu>
    <labl>Male headed family, no wife present</labl>
    </catgry>
    <catgry ID="C6">
    <catValu>6</catValu>
    <labl>Female headed family, no wife present</labl>
    </catgry>
    </var>

    The visual output of this would look something like:

    Households xxxx
    Households:
    Nonfamily xxxx
    Family xxxx
    Family:
    Married Couple xxxx
    Male head xxxx
    Female head xxxx

    Understandable to the human but too loose for the computer
    to then manipulate the data easily.
    so what's lacking is the ability to say CG1 is the equivilent of
    C1 and that CG2 is the equivilent of C3.

    Am I missing something here in terms of the problem?

    wendy

     
  • Sanda Ionescu

    Sanda Ionescu - 2004-11-03

    Logged In: YES
    user_id=1134872

    Yes, Wendy, thank you. That is exactly the problem. Sanda.

     
  • I-Lin Kuo

    I-Lin Kuo - 2004-11-04

    Logged In: YES
    user_id=298249

    1) Regarding Pascal's 2nd point:
    --------------------------------
    I agree with Pascal that in general, a level is a calculated
    attribute. Thus,

    2) Regarding Pascal's 3rd point:
    --------------------------------
    While the idea of a virtual recode was discussed, that was
    only to pin down the concept. For those who were not there,
    a virtual recode can be thought of as a recoded variable but
    without physical data -- its data being derived from another
    variable's data. This is useful when it is desired to
    redisplay a continuous variable such as date in discrete
    date ranges without having to actually create a column. I
    agree that conceptually, the catlevel part of this proposal
    should fall under this concept. However, no actual
    implementation mechanism was discussed for this.

    3) Regarding Pascal's 4th point:
    --------------------------------
    "Looking at the case B, could it be described as follows by
    defining groups and categories with the same label? This
    does not require nesting and each category can have it's
    statistics."

    While not directly pertinent to the proposal, there is a
    design viewpoint expressed here that I disagree with -- the
    preference of linking to nesting. While linking is a more
    powerful mechanism, nesting is a simpler and more robust
    mechanism. Linking IDs should really only be generated by
    software. Hand-edits of markup with linking can easily trash
    the linking structure without errors being detected, whereas
    hand-edits of nested structures are less error-prone, and
    errors may easily be detected and corrected. Thus, my
    preference is to avoid linking in those situations where its
    power is not needed. So when both linking and nesting
    mechanisms can accomplish the same goal, I would prefer
    nesting to linking. This is especially true for hierarchical
    relationships such as nested categories. For
    non-hierarchical relationships such as the relationship
    between variable and question, linking is unavoidable.

    4) Regarding Wendy's comment:
    -----------------------------
    "If you then use the catgryGrp element to indicate groupings
    of level 2 or level 3 categories you still are not
    addressing those categories that serve as BOTH a category
    with a data item attached to it and as a subgroup heading."

    That's not a problem with the revised mechanism which I
    propose in "Hierarchical Categories in 3.0" In that
    proposal, the key to resolving this is that catgryGrp should
    not used to indicate whether it has subcategories, precisely
    because of this dilemma Wendy raised. In that proposal, both
    catgryGrp and catgry may be nested, and the difference
    between a catgryGrp and catgry is whether or not it has a
    <catValu> subelement. In that proposal, both Type A and Type
    B are treated in a uniform manner.

    In Wendy's example:

    Households xxxx
    Households:
    Nonfamily xxxx
    Family xxxx
    Family:
    Married Couple xxxx
    Male head xxxx
    Female head xxxx

    This could be marked up in two ways:

    1. With catgryGrps which exactly duplicates the display above:
    <catgry>
    <catValu></catValu>
    <labl>Households</labl>
    <catgryGrp>
    <labl>Households:</labl>
    <catgry>
    <catValu></catValu>
    <labl>Nonfamily</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Family</labl>
    </catgry>
    <catgryGrp>
    <labl>Family:</labl>
    <catgry>
    <catValu></catValu>
    <labl>Married Couple</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Male head</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Female head</labl>
    </catgry>
    </catgryGrp>
    </catgryGrp>
    </catgry>

    I don't think this is desired because the Family and
    Households labels are redundant. Or it may be marked up in
    the following way:

    <catgry>
    <catValu></catValu>
    <labl>Households</labl>
    <catgry>
    <catValu></catValu>
    <labl>Nonfamily</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Family</labl>
    <catgry>
    <catValu></catValu>
    <labl>Married Couple</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Male head<</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Female head</labl>
    </catgry>
    </catgry>
    </catgry>

    which would display the following without the redundancies:

    Households xxxx
    Nonfamily xxxx
    Family xxxx
    Married Couple xxxx
    Male head xxxx
    Female head xxxx

    Please read my proposal for further details

     
  • Ken Miller

    Ken Miller - 2004-11-04

    Logged In: YES
    user_id=1145404

    The fundamental difference as I see it is that in Sandas TYPE
    B example ALL levels of the hierarchy are VALID category
    values.
    This is best described using the potentially reinstated nested
    <catgry> WITHOUT using <catgryGrp> at all (see Sandas
    markup)

    -Management, professional and related occupations (Catgry
    C1)
    -Management occupations (Catgry C2)
    -Top executives (Catgry C3)
    -Financial managers (Catgry C4)
    -Business and financial operations occupations
    (Catgry C5)
    -Computer and mathematical occupations (Catgry
    C6)
    -Architecture and engineering occupations (Catgry
    C7)
    -Architects (Catgry C8)
    -Engineers (Catgry C9)
    -Legal occupations (Catgry C10)
    -Education, training and library occupations (Catgry
    C11)
    -Teachers (Catgry C12)
    -Librarians (Catgry C13)

    The TYPE B example above could be converted to Sandas
    TYPE A (see below) where the 2 higher levels DO NOT have
    actual category values but are there just for clarification.
    This is best described using the existing <catgryGrp> using
    the catgrp attribute for Management, professional and
    related occupations grouping and the catgry attribute for the
    2nd (eg Management occupations etc) levels.

    -Management, professional and related occupations
    -Management occupations
    -Top executives (Catgry C1)
    -Financial managers (Catgry C2)
    -Business and financial operations occupations
    -Accountants (Catgry C3)
    -Computer and mathematical occupations
    - Programmers (Catgry C4)
    -Architecture and engineering occupations
    -Architects (Catgry C5)
    -Engineers (Catgry C6)
    -Legal occupations
    -Lawyers (Catgry C7)
    -Education, training and library occupations
    -Teachers (Catgry C8)
    -Librarians (Catgry C9)

    Therefore I dont see the need to complicate matters by
    introducing the levelNo attribute as this can be determined
    from the nesting. Similarly adding a <catLevel> element to
    describe the hierarchical levels (eg Country / State / County/
    City / etc) only complicates it further.

     
  • Wendy Thomas

    Wendy Thomas - 2004-11-04

    Logged In: YES
    user_id=979766

    In terms of a technical review we have 3 basic questions to
    ask:
    1) does the proposed solution adequately address the problem
    described?
    2) does it conflict with the conceptual model (version 2.0)?
    3) does it create ambiguities with existing XML instances (this
    is in terms of application of the current DTD rather than a
    question of invalidation)?

    My thoughts:
    1) The only type of variable that would be described
    differently under this proposal are those with nested
    categories aggregate data files...OR would this also apply to
    some geographic codes in microdata sets? I think we need to
    be very clear about its intended use.

    If it is the former then, yes, this addresses the problem.

    2) My concern here is with the introduction of a <catLevel>
    tag. We have a similar situation in <catgryGrp> and these
    HAVE levelNo and levelNm as attributes of <catgryGrp>. I
    would propose that we add this information at the <catgry>
    level it be done in a consistant manner with <catgryGrp>.

    3) We are creating 2 ways to address nested categories.
    There are numerous sets of metadata that use the nCube
    with the following rule: "In an additive nCube, the sum of all
    the cells should equal the universe". This means that a nested
    hierachy is describe as multiple nCubes (one for each level).
    This was done because we are describing the data contents
    rather than the layout. It was just this conflict that resulted
    in the decision to remove the "nested category" option from
    version 1.3.

    wendy

     
  • I-Lin Kuo

    I-Lin Kuo - 2004-11-04

    Logged In: YES
    user_id=298249

    Re: wlthomas 2004-11-04 09:52

    Re 2) I'm not sure that it is possible to add the @levelno
    and @levelnm to <catgry> in a way consistent with
    <catgrygrp>. From the examples that I have seen, @levelno
    and @levelnm refer to the levels within the primary
    hierarchy that the <catgrygrp> belongs to. In the situation
    pertaining to Wendy's suggestion and to the <catLevel>
    proposal, the @levelno and @levelnm belong to a second
    external but related hierarchy, if I understand correctly.

    a) I should also like to make a point that while this is
    submitted as a single proposals, it is actually three
    proposals in one (Type A markup, Type B markup, <catLevel>).
    Each of these can stand on their own and do not depend on
    each other to work. As such, I think actions should be
    applied to each separately. Thus, I'll reiterate my stance:

    Type A markup: suggest I-Lin's revision be used so that both
    Type A and Type B are handled uniformly.
    Type B markup: accept as is
    <catLevel>: reject or ask for revision to address concerns

     
  • Mark Diggory

    Mark Diggory - 2004-11-04

    Logged In: YES
    user_id=208348

    I agree with Sandas recognizing the need to actually
    represent nested categories more cleanly. I think there are
    actually three alternatives to solving this:

    1.) Nest "categories" as Sanda has described (problem is
    that more complex category structures that cannot be
    described as a tree) how are these handled? (in the DDI, the
    strategy for a solution to this has been (2).

    2.) Link "categories" in a similar fashion as used in other
    areas of the DDI using ID/IDREFS.

    3.) There is an alternate strategy: Redefine catGrps to also
    be categories in thier own right simply by adding <catValu>
    to <catgryGrp>.

    <var>
    <labl>Households</labl>

    <catgryGrp ID="CG1" catGrp="CG2" catgry="C2" level="1">
    <label>Households</labl>
    *<catValu>1</catValu>* <!-- taken from C1, C1 removed -->
    </catgryGrp>

    <catgryGrp ID="CG2" catgry="C4 C5 C6" level="2">
    <label>Family Households</labl>
    *<catValu>3</catValu>* <!-- taken from C3, C3 removed -->
    </catgryGrp>

    <catgry ID="C2">
    <catValu>2</catValu>
    <labl>Nonfamily Households</labl>
    </catgry>

    <catgry ID="C4">
    <catValu>4</catValu>
    <labl>Married Couple Household</labl>
    </catgry>

    <catgry ID="C5">
    <catValu>5</catValu>
    <labl>Male headed family, no wife present</labl>
    </catgry>

    <catgry ID="C6">
    <catValu>6</catValu>
    <labl>Female headed family, no wife present</labl>
    </catgry>

    </var>

    this results in:

    Households: xxxx
    Nonfamily xxxx
    Family: xxxx
    Married Couple xxxx
    Male head xxxx
    Female head xxxx

    instead of:

    Households xxxx
    Households:
    Nonfamily xxxx
    Family xxxx
    Family:
    Married Couple xxxx
    Male head xxxx
    Female head xxxx

    There is then only one way to define category hierarchy
    still and the presence/absence of a "catValu" defines if the
    <catgryGrp> is in itself a category as well.

    -Mark

     
  • Mark Diggory

    Mark Diggory - 2004-11-04

    Logged In: YES
    user_id=208348

    I'd also like to point out that in I-Lin's revision the
    capability of catagoryGroup to act as a grouping and is lost
    when nested categries are used.

    <catgry>
    <catValu></catValu>
    <labl>Households</labl>
    <catgry>
    <catValu></catValu>
    <labl>Nonfamily</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Family</labl>
    <catgry>
    <catValu></catValu>
    <labl>Married Couple</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Male head<</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Female head</labl>
    </catgry>
    </catgry>
    </catgry>

    This capability would be maintained if, instead of throwing
    out catagoryGroup in 3.0, it were maintained and made an
    extension of category. then in his example, the duplication
    would be removed by removing the duplicate categories, not
    the categoryGroups

    <catgryGrp>
    <labl>Households:</labl>
    *<catValu></catValu>*
    <catgry>
    <catValu></catValu>
    <labl>Nonfamily</labl>
    </catgry>
    <catgryGrp>
    <labl>Family:</labl>
    *<catValu></catValu>*
    <catgry>
    <catValu></catValu>
    <labl>Married Couple</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Male head</labl>
    </catgry>
    <catgry>
    <catValu></catValu>
    <labl>Female head</labl>
    </catgry>
    </catgryGrp>
    </catgryGrp>

    This could then still capture both A and B type Category
    structures.

    -Mark

     
  • I-Lin Kuo

    I-Lin Kuo - 2004-11-04

    Logged In: YES
    user_id=298249

    "I'd also like to point out that in I-Lin's revision the
    capability of catagoryGroup to act as a grouping and is lost
    when nested categries are used."

    Actually, I don't believe it's lost. I am simply reserving
    the grouping of <catgryGrp> only to those instances when
    <catgryGrp> is not itself a category. When the grouping is
    itself a category (ie has a <catValu>), I use a <catgry> to
    do the grouping.

    In any case, I think Mark's alternative revision is
    technically equivalent to mine in representative power. It
    requires the user to understand that a <catgryGrp> with a
    <catValu> is also a <catgry>. I think that his rule is
    slightly more intuitive than my rule: a <catgry> group
    always has a <catValu> while a <catgryGrp> never does.

    I would be fine with either revision, but I think a revision
    is needed.

     
  • Sanda Ionescu

    Sanda Ionescu - 2004-11-05

    Logged In: YES
    user_id=1134872

    I'm in a rush and can't respond properly to all these
    comments, but would just like to emphasize that if a decision
    is taken to make everything nestable, I prefer I-Lin's model
    (where groups are NOT assigned) values. Thank you all,
    Sanda.

     
  • Achim Wackerow

    Achim Wackerow - 2004-11-08

    Logged In: YES
    user_id=1145408

    With the risk to repeat similar comments, here are my thoughts.

    I tend to a view, that category hierarchies belong to a
    presentational layer of a table. Then the question will
    arise: do we want to have a presentational layer in DDI? The
    DDI should concentrate on information on data, not on the
    presentation of the data.

    Regarding the occupation example the information of category
    value and category hierarchy could be stored in a clean way
    in separate variables without changing the DDI version 2.0;
    in the example this information is stored in a mixed form in
    one variable. Actually each category hierarchy level
    represents one level of information, which could be stored
    in separate variables. In the example we have three levels
    (variables) for the concept occupation: major group, minor
    group, and unit group (this structure is not only found in
    aggregated data but also in microdata, i.e. ISCO occupation
    code). Based on this information an application could
    rebuild the category hierarchies for display purposes.

    Aside from my notes above: In the logic of the proposal I
    dont see the necessity for the element catLevel. An XSL
    stylesheet could redisplay the hierarchy when processing the
    category hierarchy in a recursive way (a computed level).
    The proposal is lacking an example which demonstrates the
    necessity of catLevel combined with the nested category
    approach.

    In general I am not sure if this is the right point of time
    to make a change to the DDI version 2.0 according to this
    proposal. We should concentrate on DDI version 3.0 to build
    a clean new model to document microdata and aggregated data.

     
  • I-Lin Kuo

    I-Lin Kuo - 2004-11-09

    Logged In: YES
    user_id=298249

    re Achim...

    I'm not sure that the hierarchies are at a pure presentation
    level. For instance, the data may have a country column and
    a highest-degree-of-education column with values
    "U.S.|Germany" and "HighSchool|College|Gymnasium" and yet
    not indicate that Gymnasium is subordinate to Germany. This
    would be part of the metadata, no?

    While true that the occupation hierarchy might be stored as
    three separate variables -- major group, minor group, and
    unit group -- I think it is incorrect to say that because
    the data may be stored in this way, we can rebuild the
    hierarchy.

    Two reasons:

    a) The first reason is philosophical. As we are striving
    toward a logical representation of the data independent of
    the physical storage, and so Achim's argument tends
    dangerously (at least it appears to me) to have DDI
    dictating the means of storage. Absent the mandatory
    creation of these three variables, I cannot think of another
    way in which the hierarchies can be constructed.

    b) The second more subtle reason is that even if these three
    variables are provided, there still must be additional
    information to reconstruct a hierarchy. From a computer's
    point of view, just because in the existing data the unit
    group "Architect" always occurs with the minor group
    "Architecture and engineering occupations" and never with
    the minor group "Legal Occupations", does not mean the
    latter pairing is impossible. This is a very important point
    as there are other hierarchies where unit groups labels may
    indeed be shared. This conceptual prohibition against the
    "Architect-legal occupations" pairing in any future data is
    an aspect of the metadata that cannot be inferred from the
    data, and thus requires some kind of markup in addition to
    the existence of the variables themselves.

     
  • Achim Wackerow

    Achim Wackerow - 2004-11-09

    Logged In: YES
    user_id=1145408

    re I-Lin

    In general I would like to have a clear and clean storage of
    data besides a good description of the data. From my
    experience of analyzing data I got the impression, that
    combined/mixed variables cause often problems for the
    researchers. I know that this intention regarding the data
    could go beyond the purpose of DDI - the description of the
    data - thus this attitude could be problematically.

    I feel not comfortable with the idea, that DDI should
    provide means of description for every kind of mixed
    variable representation (extreme example the in older days
    more common multi-punch data format). From a perspective of
    an analyst it would be more preferable to have one variable
    for one concept and a description how the variables are
    dependent from each other; ideally the archive processes the
    data to this clean form and describes it. From a perspective
    of an archivist it could perhaps make also sense to describe
    the data in whatever form. I tend to an attitude towards
    that archives/data providers should provide data and
    metadata in a form which could be used easily not only
    archived; thats apparently more the view of an
    analyst/user. The question will be where between the
    analysts and the archivists position we want to position DDI?

    Regarding the occupation example: it is correct that from a
    computer's point of view every combination between major and
    minor group could be possible. The dependency between these
    variables is often described by an external classification
    scheme. In DDI version 2.0 the dependency could be described
    with a combination of catgryGrp and catgry referring to IDs
    of other variables (probably a slight misuse of the
    intention of the IDREFS attribute of these elements, thus a
    dirty solution).

    <var name="v2">
    <labl>Occupation - minor group</labl>
    <catgryGrp catgry="v2_mingrp4" catGrp="v3_mingrp4"/>
    <catgry ID="v2_mingrp4">
    <catValu>mingrp4</catValu>
    <labl>Architecture and engineering occupations</labl>
    </catgry>
    /var>
    <var name="v3">
    <labl>Occupation - unit group</labl>
    <catgryGrp ID="v3_mingrp4" catgry="v3_C1 v3_C2">
    <labl>Architecture and engineering occupations</labl>
    </catgryGrp>
    <catgry ID="v3_C1">
    <catValu>1</catValu>
    <labl>Architects</labl>
    </catgry>
    <catgry>
    <catValu ID="v3_C2">2</catValu>
    <labl>Engineers</labl>
    </catgry>
    </var>

     
  • Wendy Thomas

    Wendy Thomas - 2004-11-10

    Logged In: YES
    user_id=979766

    If all we are trying to accomplish is identifying the relationship
    between the catgryGrp and the catgry for those variables
    that have data values for various levels of a hierarchy why
    not keep it simple:

    Add attribute "equiv" to <catgryGrp> as an IDREF to the
    <catgry>

    This accomplished the following:

    1. Allows the category group to both be equivalent to
    a category in hierarchies and to contain child categories

    Households <data>
    Nonfamily <data>
    Family <data>
    Married Couple <data>
    Male head <data>
    Female head <data>

    <var>
    <labl>Households</labl>
    <catgryGrp ID="CG1" equiv="C1" catGrp="CG2" catgry="C2"
    level="1">
    <label>Households</labl>
    </catgryGrp>
    <catgryGrp ID="CG2" equiv="C3" catgry="C4 C5 C6" level="2">
    <label>Family Households</labl>
    </catgryGrp>
    <catgry ID="C1">
    <catValu>1</catValu>
    <labl>Households</labl>
    </catgry>
    <catgry ID="C2">
    <catValu>2</catValu>
    <labl>Nonfamily Households</labl>
    </catgry>
    <catgry ID="C3">
    <catValu>3</catValu>
    <labl>Family Households</labl>
    </catgry>
    <catgry ID="C4">
    <catValu>4</catValu>
    <labl>Married Couple Household</labl>
    </catgry>
    <catgry ID="C5">
    <catValu>5</catValu>
    <labl>Male headed family, no wife present</labl>
    </catgry>
    <catgry ID="C6">
    <catValu>6</catValu>
    <labl>Female headed family, no wife present</labl>
    </catgry>
    </var>

    2. Provides option of levelno and levelnm already
    present in <catgryGrp>
    3. Allows linking between separate variable descriptions
    by pointing from the <catgryGrp> of the variable describing
    the lower level of a hierarchy to the <catgry> of a variable
    providing the upper level of the hierarchy as an equivalent
    4. Retains a consistent means of describing hierarchies
    regardless of whether data is provided for the total or sub-
    totals (upper levels of hierarchies)

     
  • Wendy Thomas

    Wendy Thomas - 2004-11-10

    Summary of discussion to date

     
  • Mark Diggory

    Mark Diggory - 2004-11-10

    Logged In: YES
    user_id=208348

    It just seems that you end up with "replicated" information,
    given that its in a "var", and vars end up consuming most of
    the space in a DDI document, I would avoid any sort of
    unneccessary replication. Given the case of a highly nested
    category structure, your basically talking about doubling
    the size of a category section in a var. Either of the
    strategies that allow categories to be nested or category
    groups to act as categories doesn't have this redundancy.

    <catgryGrp ID="CG1" equiv="C1" catGrp="CG2" catgry="C2"
    level="1">
    <label>Households</labl>
    </catgryGrp>
    <catgry ID="C1">
    <catValu>1</catValu>
    <labl>Households</labl>
    </catgry>

    vs

    <catgryGrp ID="C1" catGrp="CG2" catgry="C2" level="1">
    <label>Households</labl>
    <catValu>1</catValu>
    </catgryGrp>

    In fact one could probibly absorb the whole catGrp/catgry
    attributes into one attribute

    <catgryGrp ID="C1" catGrp="CG2" catgry="C2" level="1">
    <label>Households</labl>
    <catValu>1</catValu>
    </catgryGrp>

    -Mark

     
  • Wendy Thomas

    Wendy Thomas - 2004-11-10

    Logged In: YES
    user_id=979766

    I actually end up with this level of replication as is. I describe
    nCubes that add to their universe. In many of the files I work
    with, there are no data points within a table for totals and
    subtotals. These are separate tables/matrices/nCubes.

    In programming terms <catgryGrp> and <catgry> have been
    dealt with as two separate animals. <catgry> has data
    attached to it <catgryGrp> does not. Your suggestion would
    turn <catgryGrp> into something that was sometimes one
    thing and sometimes another. I'm really uncompfortable with
    that.

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB