DDI Alliance / 2. Technical Review / #2 reinstating nested categories

Sanda Ionescu - 2004-10-14

Proposal for Reinstating Nested Categories in the DDI

NestedCategories4.doc

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sanda Ionescu - 2004-10-15

assigned_to: nobody --> sandai
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tom Piazza - 2004-10-18

labels: 608044 --> 608046

milestone: --> 374394

assigned_to: sandai --> wlthomas
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sanda Ionescu - 2004-10-18

labels: 608046 --> Validating Technical Revision

milestone: 374394 --> [SRG] Structural Reform Group
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I-Lin Kuo - 2004-10-22

Logged In: YES
user_id=298249

IK:
There are essentially 3 parts to this proposal:
- markup of type A hierarchical categories
- markup of type B hierarchical categories
- specification of <catLevel>

For type B hierarchical categories, I recommend that it be
accepted as is. It covers the desired use case in a simple
and easily understood manner.

For type A hierarchical categories, I find the use of IDREFs
technically unnecessary, though it probably fits the
sponsors' data. As such, it may yet be adopted as an interim
measure in 2.0. However, I have submitted a proposal for
hierarchical categories for 3.0, and I feel that the SRG
should ask the sponsors to adopt my recommendation for Type
A markup for 3.0.

This specification of <catLevel>, as it currently stands,
may be of only limited use due to irregularity of the
hierarchy. For example, suppose we have a study with
detailed information about the U.S. and less detailed
information about other countries. A hierarchy might be the
following:

-- United States
-- Alaska
-- Anchorage
-- Michigan
-- Detroit
-- Lansing
-- Flint
-- Canada
-- Toronto
-- Vancouver

In this example, there is clearly a country level and a city
level. While we could specify <catLevel levelnm="countries"
levelno="1"/>, we would have a problem with <catLevel
levelnm="cities" levelno=?/> because the level number for
cities is not well-defined due to the irregularity. Yet, it
is clear that a cities level does exist.

This type of hierarchy should be fairly common.

I would recommend that the <catLevel> specification be sent
back to the author/sponsors for further revision, and that
adoption be delayed until the above example can be accomodated.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sanda Ionescu - 2004-10-22

Logged In: YES
user_id=1134872

With regard to I-Lin's comment, I need to correct a
misunderstanding: "Type A" category groupings, as
exemplified in my proposal, are NOT hierarchies - that is a
central point of my argumentation.
Also, markup of "Type A" groupings is NOT part of my
proposal - this type of markup is ALREADY ENABLED in the
DDI, in fact right now it is the only possible way to mark up
any kind of groupings.
I am only using it as an example to show that this kind of
markup is inappropriate for hierarchies.
It is incorrect to suggest that this type of markup "may be
adopted as an interim measure" in V 2.0, because it is already
in V 2.0. What I'm suggesting is to merely leave it in for the
time being.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pascal Heus - 2004-11-03

Logged In: YES
user_id=918403

A few thoughts and comments:

1) For the purpose of nesting categories, I would rather for
now keep using the catGrp attribute of the catgryGrp
element. This is consistent with variable grouping and we
should have one coherent methodology across the DDI.
Taking a different approach may create confusion.
Furthermore, removing the catGrp element from the
specification in a minor revision implies that all existing
documents must be upgraded and that the XSL transforms
and application need to be revised to take both cases into
account. I actually like the idea of nesting but this may need
to wait for the major 3.0 version.

2) Im not particularly in favor of numbering levels, this may
be difficult to maintain and can lead to inconsistencies. Im
more in favor of having the application determine out what
the levels are. I would instead rather add a universe
or definition attribute to the catgryGrp element (thought the
text element could already be used)

3) Another use case that should be taken into account by
this proposal is the one we briefly discussed during the SRG
meeting when a variable can have multiple sets of categories.
A classic case is age which can be a continuous variable but
is also often classified several age groups. If you take for
example the US Census, youll find age categorized in Age in
5 year group, Age in 10 year groups and Age 10-54 in 5
years groups. Income groups can be used as another case.
This functionality is supported by the free CSPro software
(http://www.census.gov/ipc/www/cspro/) used by many
national statistical offices for data entry and processing.

4) Looking at the case B, could it be described as follows by
defining groups and categories with the same label? This does
not require nesting and each category can have its statistics.

<catgryGrp ID="G0" catgry="C1"
catGrp="G1">Occupation</catgryGrp>
<catgryGrp ID="G1" catgry="C2 C5 C6 C7 C10 C11"
catGrp="G2 G7 G11">Management, professional and related
occupations</catgryGrp>
<catgryGrp ID="G2" catgry="C3 C4">Management
occupations</catgryGrp>
<catgryGrp ID="G7" catgry="C8 C9">Architecture and
engineering occupations</catgryGrp>
<catgryGrp ID="G11" catgry="C12 C13">Education, training
and library occupations</catgryGrp>
<catgry ID="C1">
<catValu>1</catValu>
<labl>Management, professional and related
occupations</labl>
</catgry>
<catgry ID="C2">
<catValu>2</catValu>
<labl>Management occupations</labl>
</catgry>
<catgry ID="C3">
<catValu>3</catValu>
<labl> Top executives</labl>
</catgry>
<catgry ID="C4">
<catValu>4</catValu>
<labl> Financial managers</labl>
</catgry>
<catgry ID="C5">
<catValu>5</catValu>
<labl> Business and financial operations occupations
</labl>
</catgry>
<catgry ID="C6">
<catValu>6</catValu>
<labl> Computer and mathematical
occupations</labl>
</catgry>
<catgry ID="C7">
<catValu>7</catValu>
<labl> Architecture and engineering occupations
</labl>
</catgry>
<catgry ID="C8">
<catValu>8</catValu>
<labl>Architects</labl>
</catgry>
<catgry ID="C9">
<catValu>9</catValu>
<labl>Engineers</labl>
</catgry>
<catgry ID="C10">
<catValu>10</catValu>
<labl> Legal occupations</labl>
</catgry>
<catgry ID="C11">
<catValu>11</catValu>
<labl> Education, training and library occupations
</labl>
</catgry>
<catgry ID="C12">
<catValu>12</catValu>
<labl>Teachers</labl>
</catgry>
<catgry ID="C13">
<catValu>13</catValu>
<labl>Librarians</labl>
</catgry>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sanda Ionescu - 2004-11-03

Logged In: YES
user_id=1134872

My response to Pascal's comments:
1) I am NOT proposing removal of the catGrp element. My
proposal is validating. The catGrp element will continue to be
used for conceptual groupings (non-hierarchical).
Unfortunately, it cannot be used for hierarchical groupings -
see 4) below.

2)Any comments from Nesstar? They requested level naming
and numbering for building tables.

3) I think this is a goal for Version 3.0. The current proposal
is for minor, validating change to Version 2.0

4) In Pascal's markup example there is nothing that shows
that categories 3 and 4 (G2) are in fact subordinate to
category 2, and none other. In other words, G2 needs to be
linked directly to C2. In Pascal's example, G2 is linked to G1,
but NOT to C2. We need to be able to indicate exactly how
lower-level categories link up to higher level categories,
otherwise the hierarchy cannot be recreated.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Wendy Thomas - 2004-11-03

Logged In: YES
user_id=979766

The current catgryGrp structure allows for creating
hierarchies through nesting by allowing for a catrgryGrp to
include one or more catgryGrp ID's as well as catgry ID's. The
problem, as I understand it, arises when data for multiple
levels of categories are presented as a single series (for
example geography, industrial codes, or occupation codes.

At present (in terms of the nCube structure) you have two
options. One, is to describe both levels separately and use
matching labels on the the equivilent catgry's in the upper
level description and the catgryGrp label of the more detailed
level. While this allows for data to be reported for all
categories regardless of level, and for the detailed levels to
be related to their larger groups, there is no mechanism for
clearly identifying those equivilency relationships between the
catgryGrp of one var and the catgry of another var.

The other option is to make a single variable with all
categories listed as equivilents...but you lose the nesting
pattern.

If you then use the catgryGrp element to indicate groupings
of level 2 or level 3 categories you still are not addressing
those categories that serve as BOTH a category with a data
item attached to it and as a subgroup heading.

Example:

<var>
<labl>Households</labl>
<catgryGrp ID="CG1" catGrp="CG2" catgry="C2" level="1">
<label>Households</labl>
</catgryGrp>
<catgryGrp ID="CG2" catgry="C4 C5 C6" level="2">
<label>Family Households</labl>
</catgryGrp>
<catgry ID="C1">
<catValu>1</catValu>
<labl>Households</labl>
</catgry>
<catgry ID="C2">
<catValu>2</catValu>
<labl>Nonfamily Households</labl>
</catgry>
<catgry ID="C3">
<catValu>3</catValu>
<labl>Family Households</labl>
</catgry>
<catgry ID="C4">
<catValu>4</catValu>
<labl>Married Couple Household</labl>
</catgry>
<catgry ID="C5">
<catValu>5</catValu>
<labl>Male headed family, no wife present</labl>
</catgry>
<catgry ID="C6">
<catValu>6</catValu>
<labl>Female headed family, no wife present</labl>
</catgry>
</var>

The visual output of this would look something like:

Households xxxx
Households:
Nonfamily xxxx
Family xxxx
Family:
Married Couple xxxx
Male head xxxx
Female head xxxx

Understandable to the human but too loose for the computer
to then manipulate the data easily.
so what's lacking is the ability to say CG1 is the equivilent of
C1 and that CG2 is the equivilent of C3.

Am I missing something here in terms of the problem?

wendy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sanda Ionescu - 2004-11-03

Logged In: YES
user_id=1134872

Yes, Wendy, thank you. That is exactly the problem. Sanda.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I-Lin Kuo - 2004-11-04

Logged In: YES
user_id=298249

1) Regarding Pascal's 2nd point:
--------------------------------
I agree with Pascal that in general, a level is a calculated
attribute. Thus,

2) Regarding Pascal's 3rd point:
--------------------------------
While the idea of a virtual recode was discussed, that was
only to pin down the concept. For those who were not there,
a virtual recode can be thought of as a recoded variable but
without physical data -- its data being derived from another
variable's data. This is useful when it is desired to
redisplay a continuous variable such as date in discrete
date ranges without having to actually create a column. I
agree that conceptually, the catlevel part of this proposal
should fall under this concept. However, no actual
implementation mechanism was discussed for this.

3) Regarding Pascal's 4th point:
--------------------------------
"Looking at the case B, could it be described as follows by
defining groups and categories with the same label? This
does not require nesting and each category can have it's
statistics."

While not directly pertinent to the proposal, there is a
design viewpoint expressed here that I disagree with -- the
preference of linking to nesting. While linking is a more
powerful mechanism, nesting is a simpler and more robust
mechanism. Linking IDs should really only be generated by
software. Hand-edits of markup with linking can easily trash
the linking structure without errors being detected, whereas
hand-edits of nested structures are less error-prone, and
errors may easily be detected and corrected. Thus, my
preference is to avoid linking in those situations where its
power is not needed. So when both linking and nesting
mechanisms can accomplish the same goal, I would prefer
nesting to linking. This is especially true for hierarchical
relationships such as nested categories. For
non-hierarchical relationships such as the relationship
between variable and question, linking is unavoidable.

4) Regarding Wendy's comment:
-----------------------------
"If you then use the catgryGrp element to indicate groupings
of level 2 or level 3 categories you still are not
addressing those categories that serve as BOTH a category
with a data item attached to it and as a subgroup heading."

That's not a problem with the revised mechanism which I
propose in "Hierarchical Categories in 3.0" In that
proposal, the key to resolving this is that catgryGrp should
not used to indicate whether it has subcategories, precisely
because of this dilemma Wendy raised. In that proposal, both
catgryGrp and catgry may be nested, and the difference
between a catgryGrp and catgry is whether or not it has a
<catValu> subelement. In that proposal, both Type A and Type
B are treated in a uniform manner.

In Wendy's example:

Households xxxx
Households:
Nonfamily xxxx
Family xxxx
Family:
Married Couple xxxx
Male head xxxx
Female head xxxx

This could be marked up in two ways:

1. With catgryGrps which exactly duplicates the display above:
<catgry>
<catValu></catValu>
<labl>Households</labl>
<catgryGrp>
<labl>Households:</labl>
<catgry>
<catValu></catValu>
<labl>Nonfamily</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Family</labl>
</catgry>
<catgryGrp>
<labl>Family:</labl>
<catgry>
<catValu></catValu>
<labl>Married Couple</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Male head</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Female head</labl>
</catgry>
</catgryGrp>
</catgryGrp>
</catgry>

I don't think this is desired because the Family and
Households labels are redundant. Or it may be marked up in
the following way:

<catgry>
<catValu></catValu>
<labl>Households</labl>
<catgry>
<catValu></catValu>
<labl>Nonfamily</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Family</labl>
<catgry>
<catValu></catValu>
<labl>Married Couple</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Male head<</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Female head</labl>
</catgry>
</catgry>
</catgry>

which would display the following without the redundancies:

Households xxxx
Nonfamily xxxx
Family xxxx
Married Couple xxxx
Male head xxxx
Female head xxxx

Please read my proposal for further details

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ken Miller - 2004-11-04

Logged In: YES
user_id=1145404

The fundamental difference as I see it is that in Sandas TYPE
B example ALL levels of the hierarchy are VALID category
values.
This is best described using the potentially reinstated nested
<catgry> WITHOUT using <catgryGrp> at all (see Sandas
markup)

-Management, professional and related occupations (Catgry
C1)
-Management occupations (Catgry C2)
-Top executives (Catgry C3)
-Financial managers (Catgry C4)
-Business and financial operations occupations
(Catgry C5)
-Computer and mathematical occupations (Catgry
C6)
-Architecture and engineering occupations (Catgry
C7)
-Architects (Catgry C8)
-Engineers (Catgry C9)
-Legal occupations (Catgry C10)
-Education, training and library occupations (Catgry
C11)
-Teachers (Catgry C12)
-Librarians (Catgry C13)

The TYPE B example above could be converted to Sandas
TYPE A (see below) where the 2 higher levels DO NOT have
actual category values but are there just for clarification.
This is best described using the existing <catgryGrp> using
the catgrp attribute for Management, professional and
related occupations grouping and the catgry attribute for the
2nd (eg Management occupations etc) levels.

-Management, professional and related occupations
-Management occupations
-Top executives (Catgry C1)
-Financial managers (Catgry C2)
-Business and financial operations occupations
-Accountants (Catgry C3)
-Computer and mathematical occupations
- Programmers (Catgry C4)
-Architecture and engineering occupations
-Architects (Catgry C5)
-Engineers (Catgry C6)
-Legal occupations
-Lawyers (Catgry C7)
-Education, training and library occupations
-Teachers (Catgry C8)
-Librarians (Catgry C9)

Therefore I dont see the need to complicate matters by
introducing the levelNo attribute as this can be determined
from the nesting. Similarly adding a <catLevel> element to
describe the hierarchical levels (eg Country / State / County/
City / etc) only complicates it further.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Wendy Thomas - 2004-11-04

Logged In: YES
user_id=979766

In terms of a technical review we have 3 basic questions to
ask:
1) does the proposed solution adequately address the problem
described?
2) does it conflict with the conceptual model (version 2.0)?
3) does it create ambiguities with existing XML instances (this
is in terms of application of the current DTD rather than a
question of invalidation)?

My thoughts:
1) The only type of variable that would be described
differently under this proposal are those with nested
categories aggregate data files...OR would this also apply to
some geographic codes in microdata sets? I think we need to
be very clear about its intended use.

If it is the former then, yes, this addresses the problem.

2) My concern here is with the introduction of a <catLevel>
tag. We have a similar situation in <catgryGrp> and these
HAVE levelNo and levelNm as attributes of <catgryGrp>. I
would propose that we add this information at the <catgry>
level it be done in a consistant manner with <catgryGrp>.

3) We are creating 2 ways to address nested categories.
There are numerous sets of metadata that use the nCube
with the following rule: "In an additive nCube, the sum of all
the cells should equal the universe". This means that a nested
hierachy is describe as multiple nCubes (one for each level).
This was done because we are describing the data contents
rather than the layout. It was just this conflict that resulted
in the decision to remove the "nested category" option from
version 1.3.

wendy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I-Lin Kuo - 2004-11-04

Logged In: YES
user_id=298249

Re: wlthomas 2004-11-04 09:52

Re 2) I'm not sure that it is possible to add the @levelno
and @levelnm to <catgry> in a way consistent with
<catgrygrp>. From the examples that I have seen, @levelno
and @levelnm refer to the levels within the primary
hierarchy that the <catgrygrp> belongs to. In the situation
pertaining to Wendy's suggestion and to the <catLevel>
proposal, the @levelno and @levelnm belong to a second
external but related hierarchy, if I understand correctly.

a) I should also like to make a point that while this is
submitted as a single proposals, it is actually three
proposals in one (Type A markup, Type B markup, <catLevel>).
Each of these can stand on their own and do not depend on
each other to work. As such, I think actions should be
applied to each separately. Thus, I'll reiterate my stance:

Type A markup: suggest I-Lin's revision be used so that both
Type A and Type B are handled uniformly.
Type B markup: accept as is
<catLevel>: reject or ask for revision to address concerns

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Diggory - 2004-11-04

Logged In: YES
user_id=208348

I agree with Sandas recognizing the need to actually
represent nested categories more cleanly. I think there are
actually three alternatives to solving this:

1.) Nest "categories" as Sanda has described (problem is
that more complex category structures that cannot be
described as a tree) how are these handled? (in the DDI, the
strategy for a solution to this has been (2).

2.) Link "categories" in a similar fashion as used in other
areas of the DDI using ID/IDREFS.

3.) There is an alternate strategy: Redefine catGrps to also
be categories in thier own right simply by adding <catValu>
to <catgryGrp>.

<var>
<labl>Households</labl>

<catgryGrp ID="CG1" catGrp="CG2" catgry="C2" level="1">
<label>Households</labl>
*<catValu>1</catValu>* 
</catgryGrp>

<catgryGrp ID="CG2" catgry="C4 C5 C6" level="2">
<label>Family Households</labl>
*<catValu>3</catValu>* 
</catgryGrp>

<catgry ID="C2">
<catValu>2</catValu>
<labl>Nonfamily Households</labl>
</catgry>

<catgry ID="C4">
<catValu>4</catValu>
<labl>Married Couple Household</labl>
</catgry>

<catgry ID="C5">
<catValu>5</catValu>
<labl>Male headed family, no wife present</labl>
</catgry>

<catgry ID="C6">
<catValu>6</catValu>
<labl>Female headed family, no wife present</labl>
</catgry>

</var>

this results in:

Households: xxxx
Nonfamily xxxx
Family: xxxx
Married Couple xxxx
Male head xxxx
Female head xxxx

instead of:

Households xxxx
Households:
Nonfamily xxxx
Family xxxx
Family:
Married Couple xxxx
Male head xxxx
Female head xxxx

There is then only one way to define category hierarchy
still and the presence/absence of a "catValu" defines if the
<catgryGrp> is in itself a category as well.

-Mark

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Diggory - 2004-11-04

Logged In: YES
user_id=208348

I'd also like to point out that in I-Lin's revision the
capability of catagoryGroup to act as a grouping and is lost
when nested categries are used.

<catgry>
<catValu></catValu>
<labl>Households</labl>
<catgry>
<catValu></catValu>
<labl>Nonfamily</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Family</labl>
<catgry>
<catValu></catValu>
<labl>Married Couple</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Male head<</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Female head</labl>
</catgry>
</catgry>
</catgry>

This capability would be maintained if, instead of throwing
out catagoryGroup in 3.0, it were maintained and made an
extension of category. then in his example, the duplication
would be removed by removing the duplicate categories, not
the categoryGroups

<catgryGrp>
<labl>Households:</labl>
*<catValu></catValu>*
<catgry>
<catValu></catValu>
<labl>Nonfamily</labl>
</catgry>
<catgryGrp>
<labl>Family:</labl>
*<catValu></catValu>*
<catgry>
<catValu></catValu>
<labl>Married Couple</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Male head</labl>
</catgry>
<catgry>
<catValu></catValu>
<labl>Female head</labl>
</catgry>
</catgryGrp>
</catgryGrp>

This could then still capture both A and B type Category
structures.

-Mark

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I-Lin Kuo - 2004-11-04

Logged In: YES
user_id=298249

"I'd also like to point out that in I-Lin's revision the
capability of catagoryGroup to act as a grouping and is lost
when nested categries are used."

Actually, I don't believe it's lost. I am simply reserving
the grouping of <catgryGrp> only to those instances when
<catgryGrp> is not itself a category. When the grouping is
itself a category (ie has a <catValu>), I use a <catgry> to
do the grouping.

In any case, I think Mark's alternative revision is
technically equivalent to mine in representative power. It
requires the user to understand that a <catgryGrp> with a
<catValu> is also a <catgry>. I think that his rule is
slightly more intuitive than my rule: a <catgry> group
always has a <catValu> while a <catgryGrp> never does.

I would be fine with either revision, but I think a revision
is needed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sanda Ionescu - 2004-11-05

Logged In: YES
user_id=1134872

I'm in a rush and can't respond properly to all these
comments, but would just like to emphasize that if a decision
is taken to make everything nestable, I prefer I-Lin's model
(where groups are NOT assigned) values. Thank you all,
Sanda.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Achim Wackerow - 2004-11-08

Logged In: YES
user_id=1145408

With the risk to repeat similar comments, here are my thoughts.

I tend to a view, that category hierarchies belong to a
presentational layer of a table. Then the question will
arise: do we want to have a presentational layer in DDI? The
DDI should concentrate on information on data, not on the
presentation of the data.

Regarding the occupation example the information of category
value and category hierarchy could be stored in a clean way
in separate variables without changing the DDI version 2.0;
in the example this information is stored in a mixed form in
one variable. Actually each category hierarchy level
represents one level of information, which could be stored
in separate variables. In the example we have three levels
(variables) for the concept occupation: major group, minor
group, and unit group (this structure is not only found in
aggregated data but also in microdata, i.e. ISCO occupation
code). Based on this information an application could
rebuild the category hierarchies for display purposes.

Aside from my notes above: In the logic of the proposal I
dont see the necessity for the element catLevel. An XSL
stylesheet could redisplay the hierarchy when processing the
category hierarchy in a recursive way (a computed level).
The proposal is lacking an example which demonstrates the
necessity of catLevel combined with the nested category
approach.

In general I am not sure if this is the right point of time
to make a change to the DDI version 2.0 according to this
proposal. We should concentrate on DDI version 3.0 to build
a clean new model to document microdata and aggregated data.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I-Lin Kuo - 2004-11-09

Logged In: YES
user_id=298249

re Achim...

I'm not sure that the hierarchies are at a pure presentation
level. For instance, the data may have a country column and
a highest-degree-of-education column with values
"U.S.|Germany" and "HighSchool|College|Gymnasium" and yet
not indicate that Gymnasium is subordinate to Germany. This
would be part of the metadata, no?

While true that the occupation hierarchy might be stored as
three separate variables -- major group, minor group, and
unit group -- I think it is incorrect to say that because
the data may be stored in this way, we can rebuild the
hierarchy.

Two reasons:

a) The first reason is philosophical. As we are striving
toward a logical representation of the data independent of
the physical storage, and so Achim's argument tends
dangerously (at least it appears to me) to have DDI
dictating the means of storage. Absent the mandatory
creation of these three variables, I cannot think of another
way in which the hierarchies can be constructed.

b) The second more subtle reason is that even if these three
variables are provided, there still must be additional
information to reconstruct a hierarchy. From a computer's
point of view, just because in the existing data the unit
group "Architect" always occurs with the minor group
"Architecture and engineering occupations" and never with
the minor group "Legal Occupations", does not mean the
latter pairing is impossible. This is a very important point
as there are other hierarchies where unit groups labels may
indeed be shared. This conceptual prohibition against the
"Architect-legal occupations" pairing in any future data is
an aspect of the metadata that cannot be inferred from the
data, and thus requires some kind of markup in addition to
the existence of the variables themselves.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Achim Wackerow - 2004-11-09

Logged In: YES
user_id=1145408

re I-Lin

In general I would like to have a clear and clean storage of
data besides a good description of the data. From my
experience of analyzing data I got the impression, that
combined/mixed variables cause often problems for the
researchers. I know that this intention regarding the data
could go beyond the purpose of DDI - the description of the
data - thus this attitude could be problematically.

I feel not comfortable with the idea, that DDI should
provide means of description for every kind of mixed
variable representation (extreme example the in older days
more common multi-punch data format). From a perspective of
an analyst it would be more preferable to have one variable
for one concept and a description how the variables are
dependent from each other; ideally the archive processes the
data to this clean form and describes it. From a perspective
of an archivist it could perhaps make also sense to describe
the data in whatever form. I tend to an attitude towards
that archives/data providers should provide data and
metadata in a form which could be used easily not only
archived; thats apparently more the view of an
analyst/user. The question will be where between the
analysts and the archivists position we want to position DDI?

Regarding the occupation example: it is correct that from a
computer's point of view every combination between major and
minor group could be possible. The dependency between these
variables is often described by an external classification
scheme. In DDI version 2.0 the dependency could be described
with a combination of catgryGrp and catgry referring to IDs
of other variables (probably a slight misuse of the
intention of the IDREFS attribute of these elements, thus a
dirty solution).

<var name="v2">
<labl>Occupation - minor group</labl>
<catgryGrp catgry="v2_mingrp4" catGrp="v3_mingrp4"/>
<catgry ID="v2_mingrp4">
<catValu>mingrp4</catValu>
<labl>Architecture and engineering occupations</labl>
</catgry>
/var>
<var name="v3">
<labl>Occupation - unit group</labl>
<catgryGrp ID="v3_mingrp4" catgry="v3_C1 v3_C2">
<labl>Architecture and engineering occupations</labl>
</catgryGrp>
<catgry ID="v3_C1">
<catValu>1</catValu>
<labl>Architects</labl>
</catgry>
<catgry>
<catValu ID="v3_C2">2</catValu>
<labl>Engineers</labl>
</catgry>
</var>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Wendy Thomas - 2004-11-10

Logged In: YES
user_id=979766

If all we are trying to accomplish is identifying the relationship
between the catgryGrp and the catgry for those variables
that have data values for various levels of a hierarchy why
not keep it simple:

Add attribute "equiv" to <catgryGrp> as an IDREF to the
<catgry>

This accomplished the following:

1. Allows the category group to both be equivalent to
a category in hierarchies and to contain child categories

Households <data>
Nonfamily <data>
Family <data>
Married Couple <data>
Male head <data>
Female head <data>

<var>
<labl>Households</labl>
<catgryGrp ID="CG1" equiv="C1" catGrp="CG2" catgry="C2"
level="1">
<label>Households</labl>
</catgryGrp>
<catgryGrp ID="CG2" equiv="C3" catgry="C4 C5 C6" level="2">
<label>Family Households</labl>
</catgryGrp>
<catgry ID="C1">
<catValu>1</catValu>
<labl>Households</labl>
</catgry>
<catgry ID="C2">
<catValu>2</catValu>
<labl>Nonfamily Households</labl>
</catgry>
<catgry ID="C3">
<catValu>3</catValu>
<labl>Family Households</labl>
</catgry>
<catgry ID="C4">
<catValu>4</catValu>
<labl>Married Couple Household</labl>
</catgry>
<catgry ID="C5">
<catValu>5</catValu>
<labl>Male headed family, no wife present</labl>
</catgry>
<catgry ID="C6">
<catValu>6</catValu>
<labl>Female headed family, no wife present</labl>
</catgry>
</var>

2. Provides option of levelno and levelnm already
present in <catgryGrp>
3. Allows linking between separate variable descriptions
by pointing from the <catgryGrp> of the variable describing
the lower level of a hierarchy to the <catgry> of a variable
providing the upper level of the hierarchy as an equivalent
4. Retains a consistent means of describing hierarchies
regardless of whether data is provided for the total or sub-
totals (upper levels of hierarchies)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Wendy Thomas - 2004-11-10

Summary of discussion to date

TECHREV-NestedCatrgy.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Diggory - 2004-11-10

Logged In: YES
user_id=208348

It just seems that you end up with "replicated" information,
given that its in a "var", and vars end up consuming most of
the space in a DDI document, I would avoid any sort of
unneccessary replication. Given the case of a highly nested
category structure, your basically talking about doubling
the size of a category section in a var. Either of the
strategies that allow categories to be nested or category
groups to act as categories doesn't have this redundancy.

<catgryGrp ID="CG1" equiv="C1" catGrp="CG2" catgry="C2"
level="1">
<label>Households</labl>
</catgryGrp>
<catgry ID="C1">
<catValu>1</catValu>
<labl>Households</labl>
</catgry>

vs

<catgryGrp ID="C1" catGrp="CG2" catgry="C2" level="1">
<label>Households</labl>
<catValu>1</catValu>
</catgryGrp>

In fact one could probibly absorb the whole catGrp/catgry
attributes into one attribute

<catgryGrp ID="C1" catGrp="CG2" catgry="C2" level="1">
<label>Households</labl>
<catValu>1</catValu>
</catgryGrp>

-Mark

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Wendy Thomas - 2004-11-10

Logged In: YES
user_id=979766

I actually end up with this level of replication as is. I describe
nCubes that add to their universe. In many of the files I work
with, there are no data points within a table for totals and
subtotals. These are separate tables/matrices/nCubes.

In programming terms <catgryGrp> and <catgry> have been
dealt with as two separate animals. <catgry> has data
attached to it <catgryGrp> does not. Your suggestion would
turn <catgryGrp> into something that was sometimes one
thing and sometimes another. I'm really uncompfortable with
that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

reinstating nested categories

Group

Searches

Help

#2 reinstating nested categories

Discussion