gdpdm-devel Mailing List for GDPDM
Brought to you by:
tcasstevens
You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(6) |
Nov
(9) |
Dec
(1) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(4) |
Feb
(4) |
Mar
|
Apr
(3) |
May
|
Jun
(6) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
From: Terry C. <tm...@co...> - 2006-11-16 15:44:04
|
Hi Everyone, Ed has updated the definition of the cdv_marker table in the GDPDM schema. Please take notice as you curate data into the schema. cdv_marker - This table defines the marker being scored. Markers are meiotically defined alleles that exist at a single locus. Typically, this will be a SNP, sequence repeat, or a sequence region that is polymorphic. In some cases, such as AFLP the sequence will not be known, but the band can be assigned name. This table does NOT describe the reagents used to score the marker (use div_allele_assay for that). Thanks, Terry |
From: Terry C. <tm...@co...> - 2006-07-07 19:16:12
|
Isaak, Regarding div_locality_id in tables div_obs_unit and div_accession_collecting... div_accession_collecting.div_locality_id is NOT required. So put NULL for unknown localities. div_obs_unit.div_locality_id IS required. So put as much information as you know in the div_locality table. Maybe you know the country at least. If you know nothing, make an "unknown" entry in the div_locality table for this purpose. Cheers, Terry ---------- Forwarded message from Isaak---------- > > 3. I am normalizing data in div_locality table. Not all germplasm > accessions have locality data. And, as I assign FK in the div_obs_unit or > div_accession_collecting tables, I find that some records in these tables > will not have FK in the div_locality_id field: since there is no locality > data related to them. > > What do I do? leave these records with out the FK??? please look at the > attached Fk file for illustration. > |
From: Ken Youens-C. <kc...@cs...> - 2006-06-22 21:00:08
|
On Jun 22, 2006, at 3:23 PM, Terry Casstevens wrote: > Its certainly not the most user friendly > thing, but it works pretty well. How do you see this > working in practice? Could Isaak for example use > this to maintain his development database? I think it could be used for very minor tweaks to the data that gets loaded in batches. If a curator needs to change a value in one record here and there, this tool allows it pretty easily. I still think batch tools are needed for big loads. > I don't see this being connected to the production > database on filetta (acorn is no problem of course)? No, certainly it won't be on filetta, just acorn, just connected to currently curated databases. ky |
From: Terry C. <tm...@co...> - 2006-06-22 20:23:54
|
Thanks Ken. This is cool! Its certainly not the most user friendly thing, but it works pretty well. How do you see this working in practice? Could Isaak for example use this to maintain his development database? I don't see this being connected to the production database on filetta (acorn is no problem of course)? Thanks, Terry On 6/22/06, Ken Youens-Clark <ky...@gm...> wrote: > On Jun 15, 2006, at 8:51 AM, Terry Casstevens wrote: > > > This looks like an interesting tools that we > > may be able to use to update/modify > > our GDPDM mySQL databases... > > > > > > http://sourceforge.net/projects/phpmyadmin > > http://www.phpmyadmin.net/home_page/index.php > > Terry, > > Thanks for the suggestion. It took me a few days to overcome some > installation problems, but I finally got PHPMyAdmin working with the > build 21 diversity databases on Gramene: > > http://dev.gramene.org/php/phpmyadmin/index.php > > Write me personally for the username/password to get in. It's a > pretty ugly interface but does indeed allow edits to the data on > acorn. I can't immediately find a way to control access to things > like dropping tables, nor do I see a way to create pretty forms for > the edit pages, but this is something, at least. > > FYI, it would be trivial to add other databases, e.g., if Panzea > wants to use this. Also, in case you care, I'm running this PHP > installation with an Apache 2 binary in "/usr/local/apache2" on acorn > and proxying all "/php" calls from the normal Apache 1 listening for > "dev.gramene.org" to this other server. I thought it would be best > to keep them separate. > > ky > |
From: Terry C. <tm...@co...> - 2006-06-22 20:19:11
|
Hi Isaak, How are things going with the new mySQL database that we created. Are you using the MS Access frontend to update your data now? At some point, Ken or I can give you the instructions how to create a mysqldump. Ken can use that to rebuild the Gramene Diversity databases on acorn and filetta. Thanks, Terry On 6/22/06, Ken Youens-Clark <kc...@cs...> wrote: > On Jun 22, 2006, at 2:19 PM, Isaak Yosief Tecle wrote: > > > In the new data set I am working on there are a few rice germaplasm > > accessions that are already in the diversity database. I don't have > > passport data different from what is in the database already for > > these germplasms.So how do I deal with them in the new data set > > when it comes to information (e.g. data that goes in the > > div_passport table or div_locality) that needs to be stored for > > them but is already in the database from previous curation. I > > suppose I don't have to curate the same information again and again > > for the same germplams. > > Right, well, this is the issue I raised with having accessions, > having the ability to pin-point just which existing record you mean > or defining when to create a new record. As it is, my import code > looks at the entire record you're supplying and checks the table in > question for a record matching it entirely; if one doesn't exist, a > new record is created. So you would just need to make sure that the > germplasm record in your input file matches field-by-field the data > currently in the db. > > ky > > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Gdpdm-devel mailing list > Gdp...@li... > https://lists.sourceforge.net/lists/listinfo/gdpdm-devel > |
From: Ken Youens-C. <kc...@cs...> - 2006-06-22 19:25:01
|
On Jun 22, 2006, at 2:19 PM, Isaak Yosief Tecle wrote: > In the new data set I am working on there are a few rice germaplasm > accessions that are already in the diversity database. I don't have > passport data different from what is in the database already for > these germplasms.So how do I deal with them in the new data set > when it comes to information (e.g. data that goes in the > div_passport table or div_locality) that needs to be stored for > them but is already in the database from previous curation. I > suppose I don't have to curate the same information again and again > for the same germplams. Right, well, this is the issue I raised with having accessions, having the ability to pin-point just which existing record you mean or defining when to create a new record. As it is, my import code looks at the entire record you're supplying and checks the table in question for a record matching it entirely; if one doesn't exist, a new record is created. So you would just need to make sure that the germplasm record in your input file matches field-by-field the data currently in the db. ky |
From: Ken Youens-C. <ky...@gm...> - 2006-06-22 19:00:30
|
On Jun 15, 2006, at 8:51 AM, Terry Casstevens wrote: > This looks like an interesting tools that we > may be able to use to update/modify > our GDPDM mySQL databases... > > > http://sourceforge.net/projects/phpmyadmin > http://www.phpmyadmin.net/home_page/index.php Terry, Thanks for the suggestion. It took me a few days to overcome some installation problems, but I finally got PHPMyAdmin working with the build 21 diversity databases on Gramene: http://dev.gramene.org/php/phpmyadmin/index.php Write me personally for the username/password to get in. It's a pretty ugly interface but does indeed allow edits to the data on acorn. I can't immediately find a way to control access to things like dropping tables, nor do I see a way to create pretty forms for the edit pages, but this is something, at least. FYI, it would be trivial to add other databases, e.g., if Panzea wants to use this. Also, in case you care, I'm running this PHP installation with an Apache 2 binary in "/usr/local/apache2" on acorn and proxying all "/php" calls from the normal Apache 1 listening for "dev.gramene.org" to this other server. I thought it would be best to keep them separate. ky |
From: Terry C. <tm...@co...> - 2006-06-15 13:51:55
|
This looks like an interesting tools that we may be able to use to update/modify our GDPDM mySQL databases... http://sourceforge.net/projects/phpmyadmin http://www.phpmyadmin.net/home_page/index.php |
From: Ken Youens-C. <kc...@cs...> - 2006-04-27 17:53:20
|
On Apr 27, 2006, at 12:13 PM, Terry Casstevens wrote: > Ken, > > I've added these to the pending issues list. I'll post > the updated file to the web site soon. Kewl. > One comment... the accename field is a standard IPGRI > passport descriptor. Which is why I've included it > in GDPDM. Not sure that we'd want to remove that. You'd not be removing it so much as "moving" it. Putting *all* the names into one table makes it easier to search, and then you can attribute each in the same way using the synonym type. ky > On 4/18/06, Ken Youens-Clark <kc...@cs...> wrote: >> Terry, >> >> As we discussed on this morning's Gramene conference call, here are a >> couple of ideas for possible inclusion in future GDPDM schema >> releases: >> >> - Allow a many-to-many relationship between div_passport and >> cdv_source: According to Isaak, germplasms may be available from >> more than one source. Also, I would prefer to deprecate the >> "div_passport.source" field and move this info to the normalized >> "cdv_source" table. >> >> - Allow for annotations/xrefs on germplasm names/synonyms: This is a >> combination of ideas. First off, it would be nice to attribute a >> germplasm's synonyms, e.g., to say that "IRGC 3575" is an IRRI >> synonym. This could be accomplished by adding a "synonym_type" table >> with a FK to "div_synonym": >> >> CREATE TABLE div_synonym_type ( >> div_synonym_type_id int PK, >> synonym_type varchar(50), -- e.g., "IRRI" >> xref_url text -- e.g., sprintf-type URL template, >> -- "http://www.iris.irri.org/action/ >> gms?method=getGmsList&search=%s" >> ); >> >> This would be even more useful if ALL germplasm names could be >> attributed and xref'd, so you'd remove the "div_passport.accename" >> and add a "div_display_synonym_id" field instead to indicate which >> synonym is used for the display. It creates a somewhat bizarre >> circular reference (no worse than others already in GDPDM, like those >> that reference other records in the same table [div_parent_id]), but >> it makes for a very flexible way to both search just one table for >> germplasm names and to put their sources into the db. It's possible >> some sort of "annotation" table could also be used, but this way just >> makes everything explicit. >> >> ky >> >> >> ------------------------------------------------------- >> This SF.Net email is sponsored by xPML, a groundbreaking scripting >> language >> that extends applications into web and mobile media. Attend the >> live webcast >> and join the prime developer group breaking into this new coding >> territory! >> http://sel.as-us.falkag.net/sel? >> cmd=lnk&kid=110944&bid=241720&dat=121642 >> _______________________________________________ >> Gdpdm-devel mailing list >> Gdp...@li... >> https://lists.sourceforge.net/lists/listinfo/gdpdm-devel >> > |
From: Terry C. <tm...@co...> - 2006-04-27 17:13:50
|
Ken, I've added these to the pending issues list. I'll post the updated file to the web site soon. One comment... the accename field is a standard IPGRI passport descriptor. Which is why I've included it in GDPDM. Not sure that we'd want to remove that. Terry On 4/18/06, Ken Youens-Clark <kc...@cs...> wrote: > Terry, > > As we discussed on this morning's Gramene conference call, here are a > couple of ideas for possible inclusion in future GDPDM schema releases: > > - Allow a many-to-many relationship between div_passport and > cdv_source: According to Isaak, germplasms may be available from > more than one source. Also, I would prefer to deprecate the > "div_passport.source" field and move this info to the normalized > "cdv_source" table. > > - Allow for annotations/xrefs on germplasm names/synonyms: This is a > combination of ideas. First off, it would be nice to attribute a > germplasm's synonyms, e.g., to say that "IRGC 3575" is an IRRI > synonym. This could be accomplished by adding a "synonym_type" table > with a FK to "div_synonym": > > CREATE TABLE div_synonym_type ( > div_synonym_type_id int PK, > synonym_type varchar(50), -- e.g., "IRRI" > xref_url text -- e.g., sprintf-type URL template, > -- "http://www.iris.irri.org/action/ > gms?method=3DgetGmsList&search=3D%s" > ); > > This would be even more useful if ALL germplasm names could be > attributed and xref'd, so you'd remove the "div_passport.accename" > and add a "div_display_synonym_id" field instead to indicate which > synonym is used for the display. It creates a somewhat bizarre > circular reference (no worse than others already in GDPDM, like those > that reference other records in the same table [div_parent_id]), but > it makes for a very flexible way to both search just one table for > germplasm names and to put their sources into the db. It's possible > some sort of "annotation" table could also be used, but this way just > makes everything explicit. > > ky > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting langua= ge > that extends applications into web and mobile media. Attend the live webc= ast > and join the prime developer group breaking into this new coding territor= y! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D110944&bid=3D241720&dat= =3D121642 > _______________________________________________ > Gdpdm-devel mailing list > Gdp...@li... > https://lists.sourceforge.net/lists/listinfo/gdpdm-devel > |
From: Ken Youens-C. <kc...@cs...> - 2006-04-18 15:27:35
|
Terry, As we discussed on this morning's Gramene conference call, here are a couple of ideas for possible inclusion in future GDPDM schema releases: - Allow a many-to-many relationship between div_passport and cdv_source: According to Isaak, germplasms may be available from more than one source. Also, I would prefer to deprecate the "div_passport.source" field and move this info to the normalized "cdv_source" table. - Allow for annotations/xrefs on germplasm names/synonyms: This is a combination of ideas. First off, it would be nice to attribute a germplasm's synonyms, e.g., to say that "IRGC 3575" is an IRRI synonym. This could be accomplished by adding a "synonym_type" table with a FK to "div_synonym": CREATE TABLE div_synonym_type ( div_synonym_type_id int PK, synonym_type varchar(50), -- e.g., "IRRI" xref_url text -- e.g., sprintf-type URL template, -- "http://www.iris.irri.org/action/ gms?method=getGmsList&search=%s" ); This would be even more useful if ALL germplasm names could be attributed and xref'd, so you'd remove the "div_passport.accename" and add a "div_display_synonym_id" field instead to indicate which synonym is used for the display. It creates a somewhat bizarre circular reference (no worse than others already in GDPDM, like those that reference other records in the same table [div_parent_id]), but it makes for a very flexible way to both search just one table for germplasm names and to put their sources into the db. It's possible some sort of "annotation" table could also be used, but this way just makes everything explicit. ky |
From: Peter B. <pj...@co...> - 2006-02-08 14:14:45
|
Ken and Terry, I'll throw my two cents in on a couple of topics, though most of the topics have been handled well. `accename` varchar(255) unique NOT NULL is correct syntax according to the MySQL manual: column_definition: col_name type [NOT NULL | NULL] [DEFAULT default_value] [AUTO_INCREMENT] [UNIQUE [KEY] | [PRIMARY] KEY] [COMMENT 'string'] [reference_definition] Terry and talked about populating new database instances with basic reference data. So, I think that's a good idea. Although, it would be easy enough to make that a separate script or to create a data-free version of the database creation script if that's needed. On recurrent parent info: That's very basic pedigree information. If you don't know it, you really do not have a complete pedigree and probably should not be putting the information in the database. When you make a cross, if you then backcross the progeny to one of the parents, the parent you backcrossed to is the recurrent parent. Requiring recurrent makes sense and 0 would be a reasonable default value. On making accession fields unique: Since these are identifiers from external databases, a GDPDM database manager has no control over them. It would, in general, be dangerous to make them unique. For instance, it would be quite possible to define two similar traits in a local db that both link to the same trait_ontology term. div_allele.accession is specific to a div_allele.referencedb, so it is really a combination of the two fields that might be unique. Since a div_allele can have more than one value there might legitimately be more than one record in the table for an external allele.accession. Peter At 07:24 PM 2/7/2006, you wrote: >Hi Ken, > >Thanks for reviewing the DDL. >Please see my embedded comments below. > > > >Cheers, > >Terry > > >On 2/6/06, Ken Youens-Clark <ky...@gm...> wrote: > > On Feb 3, 2006, at 4:43 AM, Terry Casstevens wrote: > > > > > I have the latest version 2.0 DDL and documentation > > > on the GDPDM web site (NOTE: this does NOT have > > > the addition tables recently discussed). Please take > > > a look to make sure everything looks good to you. > > > CVS on sourceforge is having problems, so I'll > > > check this in later. Let me know if you have > > > any suggestions? > > > > > > http://www.maizegenetics.net/gdpdm/documentation.html > > > > Hmm, this is odd. The first thing I tried was to graph this using > > SQL::Translator and Graphviz, and SQLT choked on this: > > > > CREATE TABLE `div_passport` ( > > ... > > `accename` varchar(255) unique NOT NULL > > > > That "unique" in there is odd! MySQL dumped this? I'm only used to > > see UNIQUE constraints declared like: > > > > UNIQUE(`accename`) > > > >No, mysqldump did not produce that. I put that in. I wasn't >sure how it should be, but it executed correctly when I built >a new database? I will change it if you think I should? > > > > > > I also had to remove the data insertion statements (SQLT doesn't know > > about that syntax), but you don't really want to be distributing a > > schema with default info, right? Make sure to run "mysqldump --no- > > data", I think. > > > >I did use the --no-data option when I first generated the script. >I put those insert statements in afterward. If you look at the >tables and values that are being inserted, you'll see they are very >generic. And from my understanding, needed for pretty much >any GDPDM database. So I meant to make it easier for >folks. Do you still disagree? > > > > > > All the relationships parse and draw well, so I think the FKs are good. > > > > I'm a little bothered by some inconsistencies, like "div_passport" > > has an "accename" accession field, but "div_allele" has one called > > "accession." I wish this idea was consistent across all tables. > > Also, "div_accession_collection" uses underscores and the abbr. "col" > > for "col_date" but not for "collnumb", "collsrc" and "collcode". I > > see other places underscores aren't used to separate words, like > > "div_allele.referencedb", "div_locality.origcty", > > "div_passport.accenumb", and "div_passport.sampstat". > > > >Let me first say that I couldn't agree with you more about being >consistent for easy use etc..... > >Although.... >- At the time of developing this schema, I didn't see accename, > accession, etc. as being the same function. And I wasn't > planning to include such in all tables. >- collnumb, collsrc, collcode, etc. are from the IPGRI Passport > Descriptors. That is why I choice those forms. >- And most of all. It is way too late to be changing this stuff! > If we change stuff like this now, it would be a huge headache > to modify all the existing GDPDM databases. Adding tables > (or even fields) isn't a problem. That doesn't break any > existing databases/tools, but changing existing stuff is > very bad. > > > > > > Should "div_passport.germplasm_type" be a separate lookup table to a > > "germplasm_type" table exporting a "germplasm_type_id"? It bothers > > me that the table is called "passport" but that it has a > > "germplasm_type" -- it seems like either the table should be called > > "germplasm" or this field should be called "passport_type." Either > > way, is this relationship denormalized? > > > >Please see documentation. germplasm_type is deprecated. >sampstat should be used instead. > > > > > > I think "div_stock_parent.recurrent" is ambiguous and should be > > renamed "is_recurrent." Perhaps it should also have a default value > > of either "0" or "1" -- is there some biological significance to > > declaring it needs to have a value? > > > >I would have made this change a year ago. Too late now. >Maybe it could default to "0," but that field is required. >Every defined parent will either be recurrent or not. Unless >maybe it is unknown. We'd need to ask a biologist. > > > > > > I'd like to see the order of fields be PK, FKs, "name"-like field, > > other fields, "comments", so I'd like to see > > "div_allele_assay.div_ref_stock_id" moved up, perhaps after > > "div_scoring_tech_type_id"? > > > >I thought I did all those. I'll change it. > > > > > > I think "cdv_marker.marker_aid" indicates an accession, right? So it > > should probably be renamed to something more consistent with the rest > > of the schema, like "accename" or "accession." (Embarrassingly, I > > think I lobbied for that field. :-) > > > >Too late! > > > > > > > I know you are still wanting the accession fields > > > added to the schema. Maybe we can discuss > > > it at the Gramene retreat. Both Ed and I are > > > planning to attend. > > > > I feel like there are already several accessions in there: > > > > div_passport.accename > > div_allele.accession + > > div_trait_uom.to_accession * + > > div_taxonomy.term_accession * + > > div_treatment_uom.eo_accession * + > > cdv_marker.marker_aid * + > > > > * - Really an external accession, but the same idea > > + - Should there be a UNIQUE constraint on these? > > > >We can certainly consider making these unique. >But we'd need to think about each one to make >sure we know that its ok. > > > > > > I think it's just a matter of "embracing and extending" this idea. :-) > > > >I don't think we are going to solve this by email. >We (Ken, Terry, Ed, Peter, anyone else interested) probably >need to discuss this face to face. > > > > > > ky > > > > >------------------------------------------------------- >This SF.net email is sponsored by: Splunk Inc. Do you grep through log files >for problems? Stop! Download the new AJAX search engine that makes >searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! >http://sel.as-us.falkag.net/sel?cmdk&kid3432&bid#0486&dat1642 >_______________________________________________ >Gdpdm-devel mailing list >Gdp...@li... >https://lists.sourceforge.net/lists/listinfo/gdpdm-devel |
From: Terry C. <tm...@co...> - 2006-02-08 00:25:06
|
Hi Ken, Thanks for reviewing the DDL. Please see my embedded comments below. Cheers, Terry On 2/6/06, Ken Youens-Clark <ky...@gm...> wrote: > On Feb 3, 2006, at 4:43 AM, Terry Casstevens wrote: > > > I have the latest version 2.0 DDL and documentation > > on the GDPDM web site (NOTE: this does NOT have > > the addition tables recently discussed). Please take > > a look to make sure everything looks good to you. > > CVS on sourceforge is having problems, so I'll > > check this in later. Let me know if you have > > any suggestions? > > > > http://www.maizegenetics.net/gdpdm/documentation.html > > Hmm, this is odd. The first thing I tried was to graph this using > SQL::Translator and Graphviz, and SQLT choked on this: > > CREATE TABLE `div_passport` ( > ... > `accename` varchar(255) unique NOT NULL > > That "unique" in there is odd! MySQL dumped this? I'm only used to > see UNIQUE constraints declared like: > > UNIQUE(`accename`) > No, mysqldump did not produce that. I put that in. I wasn't sure how it should be, but it executed correctly when I built a new database? I will change it if you think I should? > > I also had to remove the data insertion statements (SQLT doesn't know > about that syntax), but you don't really want to be distributing a > schema with default info, right? Make sure to run "mysqldump --no- > data", I think. > I did use the --no-data option when I first generated the script. I put those insert statements in afterward. If you look at the tables and values that are being inserted, you'll see they are very generic. And from my understanding, needed for pretty much any GDPDM database. So I meant to make it easier for folks. Do you still disagree? > > All the relationships parse and draw well, so I think the FKs are good. > > I'm a little bothered by some inconsistencies, like "div_passport" > has an "accename" accession field, but "div_allele" has one called > "accession." I wish this idea was consistent across all tables. > Also, "div_accession_collection" uses underscores and the abbr. "col" > for "col_date" but not for "collnumb", "collsrc" and "collcode". I > see other places underscores aren't used to separate words, like > "div_allele.referencedb", "div_locality.origcty", > "div_passport.accenumb", and "div_passport.sampstat". > Let me first say that I couldn't agree with you more about being consistent for easy use etc..... Although.... - At the time of developing this schema, I didn't see accename, accession, etc. as being the same function. And I wasn't planning to include such in all tables. - collnumb, collsrc, collcode, etc. are from the IPGRI Passport Descriptors. That is why I choice those forms. - And most of all. It is way too late to be changing this stuff! If we change stuff like this now, it would be a huge headache to modify all the existing GDPDM databases. Adding tables (or even fields) isn't a problem. That doesn't break any existing databases/tools, but changing existing stuff is very bad. > > Should "div_passport.germplasm_type" be a separate lookup table to a > "germplasm_type" table exporting a "germplasm_type_id"? It bothers > me that the table is called "passport" but that it has a > "germplasm_type" -- it seems like either the table should be called > "germplasm" or this field should be called "passport_type." Either > way, is this relationship denormalized? > Please see documentation. germplasm_type is deprecated. sampstat should be used instead. > > I think "div_stock_parent.recurrent" is ambiguous and should be > renamed "is_recurrent." Perhaps it should also have a default value > of either "0" or "1" -- is there some biological significance to > declaring it needs to have a value? > I would have made this change a year ago. Too late now. Maybe it could default to "0," but that field is required. Every defined parent will either be recurrent or not. Unless maybe it is unknown. We'd need to ask a biologist. > > I'd like to see the order of fields be PK, FKs, "name"-like field, > other fields, "comments", so I'd like to see > "div_allele_assay.div_ref_stock_id" moved up, perhaps after > "div_scoring_tech_type_id"? > I thought I did all those. I'll change it. > > I think "cdv_marker.marker_aid" indicates an accession, right? So it > should probably be renamed to something more consistent with the rest > of the schema, like "accename" or "accession." (Embarrassingly, I > think I lobbied for that field. :-) > Too late! > > > I know you are still wanting the accession fields > > added to the schema. Maybe we can discuss > > it at the Gramene retreat. Both Ed and I are > > planning to attend. > > I feel like there are already several accessions in there: > > div_passport.accename > div_allele.accession + > div_trait_uom.to_accession * + > div_taxonomy.term_accession * + > div_treatment_uom.eo_accession * + > cdv_marker.marker_aid * + > > * - Really an external accession, but the same idea > + - Should there be a UNIQUE constraint on these? > We can certainly consider making these unique. But we'd need to think about each one to make sure we know that its ok. > > I think it's just a matter of "embracing and extending" this idea. :-) > I don't think we are going to solve this by email. We (Ken, Terry, Ed, Peter, anyone else interested) probably need to discuss this face to face. > > ky > |
From: Ken Youens-C. <ky...@gm...> - 2006-02-06 20:43:58
|
On Feb 6, 2006, at 2:37 PM, Ken Youens-Clark wrote: >> I have the latest version 2.0 DDL and documentation >> on the GDPDM web site (NOTE: this does NOT have >> the addition tables recently discussed). Please take >> a look to make sure everything looks good to you. >> CVS on sourceforge is having problems, so I'll >> check this in later. Let me know if you have >> any suggestions? Oh, one other thing I meant to include was about the field "div_generation.icis_id" -- since every other field ending in "_id" is either a PK or FK, perhaps this field should be renamed to "icis_term" or something to preclude any assumption (esp. by an automated system) that this is a FK? ky |
From: Ken Youens-C. <ky...@gm...> - 2006-02-06 20:37:40
|
On Feb 3, 2006, at 4:43 AM, Terry Casstevens wrote: > I have the latest version 2.0 DDL and documentation > on the GDPDM web site (NOTE: this does NOT have > the addition tables recently discussed). Please take > a look to make sure everything looks good to you. > CVS on sourceforge is having problems, so I'll > check this in later. Let me know if you have > any suggestions? > > http://www.maizegenetics.net/gdpdm/documentation.html Hmm, this is odd. The first thing I tried was to graph this using SQL::Translator and Graphviz, and SQLT choked on this: CREATE TABLE `div_passport` ( ... `accename` varchar(255) unique NOT NULL That "unique" in there is odd! MySQL dumped this? I'm only used to see UNIQUE constraints declared like: UNIQUE(`accename`) I also had to remove the data insertion statements (SQLT doesn't know about that syntax), but you don't really want to be distributing a schema with default info, right? Make sure to run "mysqldump --no- data", I think. All the relationships parse and draw well, so I think the FKs are good. I'm a little bothered by some inconsistencies, like "div_passport" has an "accename" accession field, but "div_allele" has one called "accession." I wish this idea was consistent across all tables. Also, "div_accession_collection" uses underscores and the abbr. "col" for "col_date" but not for "collnumb", "collsrc" and "collcode". I see other places underscores aren't used to separate words, like "div_allele.referencedb", "div_locality.origcty", "div_passport.accenumb", and "div_passport.sampstat". Should "div_passport.germplasm_type" be a separate lookup table to a "germplasm_type" table exporting a "germplasm_type_id"? It bothers me that the table is called "passport" but that it has a "germplasm_type" -- it seems like either the table should be called "germplasm" or this field should be called "passport_type." Either way, is this relationship denormalized? I think "div_stock_parent.recurrent" is ambiguous and should be renamed "is_recurrent." Perhaps it should also have a default value of either "0" or "1" -- is there some biological significance to declaring it needs to have a value? I'd like to see the order of fields be PK, FKs, "name"-like field, other fields, "comments", so I'd like to see "div_allele_assay.div_ref_stock_id" moved up, perhaps after "div_scoring_tech_type_id"? I think "cdv_marker.marker_aid" indicates an accession, right? So it should probably be renamed to something more consistent with the rest of the schema, like "accename" or "accession." (Embarrassingly, I think I lobbied for that field. :-) > I know you are still wanting the accession fields > added to the schema. Maybe we can discuss > it at the Gramene retreat. Both Ed and I are > planning to attend. I feel like there are already several accessions in there: div_passport.accename div_allele.accession + div_trait_uom.to_accession * + div_taxonomy.term_accession * + div_treatment_uom.eo_accession * + cdv_marker.marker_aid * + * - Really an external accession, but the same idea + - Should there be a UNIQUE constraint on these? I think it's just a matter of "embracing and extending" this idea. :-) ky |
From: Ken Youens-C. <kc...@cs...> - 2006-01-10 22:57:48
|
Terry, Should there be a UNIQUE constraint on div_passport.accename? ky |
From: Ken Youens-C. <ky...@gm...> - 2006-01-10 22:50:52
|
Terry, Just to document what I pointed out on the phone today, the "cdv_marker" is missing a foreign key declaration on its "div_ref_stock_id" field to the "div_stock" table. ky |
From: Terry C. <tca...@gm...> - 2006-01-10 18:57:31
|
Hi Ken, That link between div_allele_assay and div_stock is a "reference" only. The path that you suggest is the correct path. Look at my graphics picture of the GDPDM schema. I didn't even draw that link between div_allele_assay and div_stock. With your dynamic path algorithm, leave out any link that I did not draw on the schema diagram. That will solve any ambiguities that you have. Cheers, Terry On 1/10/06, Ken Youens-Clark <kc...@cs...> wrote: > Terry || Ed || Isaak, > > Per a request from Isaak, I'm trying to create a display of a record > from the "div_experiment" table that includes its related > "div_scoring_tech_type." My code is using some graph theory to > automatically navigate the schema to get between these tables. Using > a shortest path-finding algorithm developed by Dijkstra, it has > determined the path to be this: > > div_experiment > div_obs_unit > div_stock > div_allele_assay > div_scoring_tech_type > > However, looking at the schema, I think it should probably go like this: > > div_experiment > div_obs_unit > div_obs_unit_sample > div_allele > div_allele_assay > div_scoring_tech_type > > Is there a preferred path, or are both valid? How do you feel about > there being multiple possible paths through the schema? Is this a > sign of weakness or robustness? I don't know the biology nearly > enough to comment on how well the schema mirrors the data, but I'm > worried by the fact that the schema isn't explicit in its definition > as to how to get from one table to the other. > > ky > > |
From: Ken Youens-C. <kc...@cs...> - 2006-01-10 14:55:57
|
Terry || Ed || Isaak, Per a request from Isaak, I'm trying to create a display of a record from the "div_experiment" table that includes its related "div_scoring_tech_type." My code is using some graph theory to automatically navigate the schema to get between these tables. Using a shortest path-finding algorithm developed by Dijkstra, it has determined the path to be this: div_experiment div_obs_unit div_stock div_allele_assay div_scoring_tech_type However, looking at the schema, I think it should probably go like this: div_experiment div_obs_unit div_obs_unit_sample div_allele div_allele_assay div_scoring_tech_type Is there a preferred path, or are both valid? How do you feel about there being multiple possible paths through the schema? Is this a sign of weakness or robustness? I don't know the biology nearly enough to comment on how well the schema mirrors the data, but I'm worried by the fact that the schema isn't explicit in its definition as to how to get from one table to the other. ky |
From: Peter B. <pj...@co...> - 2005-12-31 17:49:37
|
I was just noticing that the treatment tables and the trait tables in GDPDM have exactly the same structure. There are a few more fields in the trait tables, but treatment data could be stored in the trait tables with no problem. Also, in some cases, the distinction between treatments and traits is blurred. A concrete example is some data that I helped upload a while ago. It was data collected as part of an aluminum tolerance experiment. Each taxon had two treatments (Al vs noAl) and root length was measured on three days. The root growth measurement was hard on the plants, so it was basically a destructive measurement, so different plants were measured on each day. The resulting data could be uploaded in a few different ways: Aluminum and day as treatment, root length as trait. Aluminum as treatment, day and root length as trait. All three as traits. Aluminum treatment as block in the plots table, day as either trait or treatment, root length as treatment. day1_root_length, day2_root_length, day3_root_length as different traits All those ways of loading the data would work. However, to retrieve the data, I have to know something about how it was stored. Either that or you always look in both the treatment and trait tables for whatever is there. If you always have to check both tables, you might as well just store it all in one and make things simpler. Other items that blur the distinction between treatment and trait: In the above analysis you might want to treat day as a categorical variable or as a continuous variable ( a covariate). In general, days to an event could be considered a treatment as in the above example, or as a trait as in days to silk. Treatment and trait data can either be continuous or discrete. The above argues pretty strongly that the treatment and trait tables should be merged, which simplifies both uploading and downloading. For the ontology folks it also suggests that there should not be separate trait and environment ontologies. Have a happy New Year. Peter Bradbury USDA-ARS 741 Rhodes Hall Cornell University Ithaca, NY 14853 607-255-5392 |
From: Ken Youens-C. <kc...@cs...> - 2005-11-15 22:06:18
|
OK, I've discussed this with Terry, so I'm reasonably sure I'm not mistaken on this. I think the "cdv_marker" is missing a foreign key constraint on the "div_ref_stock_id" field to the "div_stock" table. ky Begin forwarded message: > From: Terry Casstevens <tm...@co...> > Date: November 15, 2005 4:03:23 PM CST > To: Ken Youens-Clark <kc...@cs...> > Subject: Re: Error in "cdv_marker" def > > Yes, I think so. Here's what the docs say. > > foreign key referencing stock associated with the > reference sequence. > > Its related to the ref_seq field. > > Past that, you'd probably have to ask Ed or Peter > why exactly it is there. > > > > > > On 11/15/05, Ken Youens-Clark <kc...@cs...> wrote: >> OK, so I don't embarrass myself again by sending this to the GDPDM >> list, let me approach just you on this. Checkit: >> >> CREATE TABLE `cdv_marker` ( >> `cdv_marker_id` int unsigned NOT NULL auto_increment, >> `cdv_map_feature_id` int unsigned default NULL, >> `name` varchar(255) default NULL, >> `position` int default NULL, >> `length` int default NULL, >> `ref_seq` text, >> `div_ref_stock_id` int unsigned default NULL, >> `marker_aid` varchar(255) default NULL, >> PRIMARY KEY (`cdv_marker_id`), >> KEY `cdv_map_feature_id` (`cdv_map_feature_id`), >> KEY `marker_aid` (`marker_aid`), >> CONSTRAINT `cdv_marker_ibfk_1` FOREIGN KEY (`cdv_map_feature_id`) >> REFERENCES `cdv_map_feature` (`cdv_map_feature_id`) >> ) ENGINE=InnoDB; >> >> Should there be a FK from "div_ref_stock_id" to the "div_stock" >> table? >> >> ky >> > |
From: Ken Youens-C. <ky...@gm...> - 2005-11-15 16:57:13
|
On Nov 15, 2005, at 10:50 AM, Ken Youens-Clark wrote: > I believe the error is with the FK constraint > "div_allele_assay_ibfk_1" -- it references the table itself and > it's PK. I believe that constraint should be dropped. Terry, Sorry for the confusion. I see there's no problem with the schema. I'm looking at some data that appears to be incorrect, and I just got confused. Still, circular references on a table! ky |
From: Ken Youens-C. <kc...@cs...> - 2005-11-15 16:50:42
|
Terry, I think there's an error in the table definition for "div_allele_assay." It's currently this: CREATE TABLE `div_allele_assay` ( `div_allele_assay_id` int unsigned NOT NULL auto_increment, `div_source_assay_id` int unsigned default NULL, `div_poly_type_id` int unsigned default NULL, `div_scoring_tech_type_id` int unsigned default NULL, `cdv_marker_id` int unsigned default NULL, `comments` text, `assay_date` datetime default NULL, `name` varchar(255) default NULL, `producer` varchar(255) default NULL, `position` varchar(255) default NULL, `ref_seq` text, `div_ref_stock_id` int unsigned default NULL, `length` int default NULL, PRIMARY KEY (`div_allele_assay_id`), KEY `div_source_assay_id` (`div_source_assay_id`), KEY `div_poly_type_id` (`div_poly_type_id`), KEY `cdv_marker_id` (`cdv_marker_id`), KEY `div_ref_stock_id` (`div_ref_stock_id`), KEY `assay_date` (`assay_date`), KEY `name` (`name`), KEY `div_scoring_tech_type_id` (`div_scoring_tech_type_id`), CONSTRAINT `div_aa_annotation_ibfk_3` FOREIGN KEY (`div_scoring_tech_type_id`) REFERENCES `div_scoring_tech_type` (`div_scoring_tech_type_id`), CONSTRAINT `div_allele_assay_ibfk_1` FOREIGN KEY (`div_source_assay_id`) REFERENCES `div_allele_assay` (`div_allele_assay_id`), CONSTRAINT `div_allele_assay_ibfk_2` FOREIGN KEY (`div_poly_type_id`) REFERENCES `div_poly_type` (`div_poly_type_id`), CONSTRAINT `div_allele_assay_ibfk_3` FOREIGN KEY (`cdv_marker_id`) REFERENCES `cdv_marker` (`cdv_marker_id`), CONSTRAINT `div_allele_assay_ibfk_4` FOREIGN KEY (`div_ref_stock_id`) REFERENCES `div_stock` (`div_stock_id`) ) ENGINE=InnoDB; I believe the error is with the FK constraint "div_allele_assay_ibfk_1" -- it references the table itself and it's PK. I believe that constraint should be dropped. ky |
From: Ken Youens-C. <kc...@cs...> - 2005-11-09 16:51:08
|
On Nov 3, 2005, at 9:57 AM, Peter Bradbury wrote: > One thing missing from GDPDM that we might want to consider adding > is information about which combination of fields in each table > determine a unique record. One other point I meant to make in response to this suggestion is that it places an additional burden on the curator to know which combination of fields determines uniqueness. It would change on a table-by-table basis, whereas a single "accession" field would be forever unambiguous. Further, requiring several fields to identify a record means one typo in one field could mean duplication. I believe it's too much to ask of human curators given how complex this schema is, how much human effort is required to curate the data, and how limited their options are at this time for inserting it. ky |
From: Ken Youens-C. <ky...@gm...> - 2005-11-09 15:27:43
|
On Nov 3, 2005, at 9:57 AM, Peter Bradbury wrote: > In response to Kens suggestion that new identifiers be added to > GDPDM, my feeling is that it is unnecessary at best and may > actually make the task of database maintenance more complicated not > less. One of the issues that Ken raises is how to tell whether a > proposed addition to the database already exists or is new. That > is indeed a difficult issue to address for several reasons. > However, the information needed to answer the question already > exists in the database. The information does not currently exist as an explicit part of the schema definition. I'm asking that each table have a UNIQUE value apart from it's primary key. > Adding another field that simply combines or duplicates existing > information doesn't help and violates good database design principles. I don't believe most accessions duplicate data. E.g., a GenBank accession such as "D50439" doesn't duplicate anything about the marker "P108," in fact it quite easily helps differentiate it from any other marker that might happen to be named "P108," and, further, allows or even encourages the duplication of marker names, which is a Good Thing (tm) in that it allows the database to reflect the reality that biologists sometimes choose the same name for different things. If we were to put a UNIQUE constraint on marker names, we probably won't be able to put in some rice SSRs, e.g.: http://dev.gramene.org/db/markers/marker_view?marker_name=rm1003 > That does not, however, mean that database tables that specifically > support the upload tools should not be created. I don't believe it's necessary to add new tables, just UNIQUE constraints. > They would not be part of GDPDM but part of the upload tool > application. I would suggest keeping them in a separate schema. > During development they could even be on a different machine. That > would provide complete freedom to use text fields to represent > FK's, deal with security issues that may require storage of > information about users, etc. I believe this suggestion is actually the one where data would be duplicated, only it would be in extra tables possibly in external databases. Not being an integral part of the schema, it would also not help me in integrating the GDPDM data into the rest of Gramene, e.g., how do I easily link into species or germplasm data without relying on knowing the primary key value (which might change)? That is, from our marker or comparative maps, I want to show the user more about what we're calling "rice," which is actually "Oryza sativa japonica," so I want to link to a GDPDM viewer using some value other than "42," something with a modicum of meaning like "osat" and something which I know won't change if the data gets updated or reloaded. > One thing missing from GDPDM that we might want to consider adding > is information about which combination of fields in each table > determine a unique record. If all those fields do not match then a > record is new. If all those fields match, then any other fields > that may be different get updated (if the user has permission). I > can think of a few different ways of implementing rules about > uniqueness of combinations of fields as follows: > 1. Determine a set of rules specific to the upload tools. The > upload tools will handle data in that way, but nothing else has to. > 2. Formally document those rules as part of GDPDM. > 3. Create unique indexes in GDPDM implementing those rules. I'm a big proponent of automatic code generation when it makes sense. For this project, given the number of tables, I've definitely taken this approach and have therefore spent a lot of time agonizing over the structure of the schema, e.g., field order and such. I believe my problems would be solved with the addition of UNIQUE constraints in the database, be they on single or multiple fields. So, I believe your #3 suggestion takes care of 1 and 2, too. However, I'm on the fence with relying solely on multiple fields to uniquely identify a record. I know this would work, e.g., a species could be said to be a unique combination of genus, species, subspecies, subtaxa, race and population, but wouldn't it be easier to just say that a new "accession" field uniquely identifies the record? It's a difference in using six fields instead of one. The sheer number plus the factor of human mistake in misspelling just one of those fields raises the distinct probability that data will be duplicated. Further, it makes my earlier "linking in" problem more problematic. E.g., would the URL to get to a specific taxonomy record have to look like this: /db/diversity_view? action=view&object=div_taxonomy&genus=Oryza&species=sativa&subspecies=ja ponica&subtaxa=temperate&race=foo&population=bar Wouldn't this be a lot easier to generate? /db/diversity_view?action=view&object=div_taxonomy&accession=osat > It probably makes sense to do 1 first, then once there is a working > rule set consider formally documenting and implementing them as > indexes. Yes, I agree completely. Either way, we're talking about a schema change, though adding constraints is far less instrusive than adding fields. There again, if you try to add UNIQUE constraints to fields that have duplicated data already, then it will be a pretty big burden to fix the data, too, I guess. Adding a new "accession" field could be easy enough in that, to get started, you could use a script to assign the PK value to the accession and then let the curator put something more meaningful there later. This would work with databases that currently have duplication. > Another really sticky problem is you do not want to create new > germplasm or other types of records just because a user spelled > something in one of the key fields wrong. For example, B73Ht, > B73H, B73 Ht, and B73_Ht could all be reasonable ways to represent > the same accession. The only solution that I came up with was to > provide the user with a way to look up information from the > database. For example, show me all the records that start with B7. I believe this would be exactly the kind of problem that could be solved by the addition of an accession plus careful data curation. Regardless of how the germplasm's name was put into the database, a group would know that they've assigned it the accession "b73ht" and would just use that. > A third item, which I assume is a major driver behind Ken's > suggestion, is that when adding records to a table, a user must > establish the correct references to linked tables. In other words, > the user either has to enter the correct FK's or provide enough > information that they can be correctly determined. I take Ken's > suggestion as using a text field or mnemonic to represent the > FK's. In writing the Excel upload tool, I actually made use of all > three methods where each seemed to make sense. > > To upload germplasm records, I used a single spreadsheet (or view) > to upload data to the div_passport, div_stock, and div_stock_parent > tables. All the individual fields necessary for determining some > of the FK's were there. For taxonomy_id, I required that the user > look up the correct id from the taxonomy table and enter the actual > FK value. That works because it is likely an entire upload will > use only a single value. For each row in the germplasm table, the > user enters a UserTaxaName. The UserTaxaName is used in other > spreadsheets to refer to that germplasm entry. In the upload > process, it is used to look up a div_stock_id in the germplasm > spreadsheet. Essentially, the Excel file stores a set of text > mnenomics for FK's specific to that Excel workbook. So, it's not > much of a stretch to use a database table connected with the upload > tools to store information like that, either globally for all users > or on a per user basis. To quote an ex-President, "I feel your pain" on the Excel stuff. I, too, am letting Isaak start from an Excel file to organize the data [1], though he's having to keep up only with the PK values, which he has to assign, in order to relate the data in the different worksheets. I generate the Excel file directly from the schema definition in order to ensure that all the identifiers are present and properly spelled, so there are no helper fields as you have created. You are correct that one of the things I'm shooting for is to add mnenomic devices for more easily relating data. I don't intend to my Excel "interface" to be my last effort[2], so perhaps this won't always be the most important point for me. However, the linking in question would continue to be resolved by the addition of such a mnemonic. I believe my suggestion is in the spirit of Einstein's suggestion that "things should be made as simple as possible -- but no simpler." Adding just one field that always must be unique is easier than deciding which combination, if any, or other fields is unique. For those disinclined to use an accession, it could automatically be assigned the PK value of the record, so it would be no burden. What I need is for the schema itself to be explicit about what uniquely identifies records, so I believe the schema must be changed to address this issue either way. ky [1] FWIW, the Excel file is transformed into an structured text file (S-expressions, indented text, or XML) before being fed to my upload tool, removing any dependence on Bill Gates or his efforts to make different versions of Excel incompatible and to force everyone to upgrade to the latest version. [2] I'm investigating a "Ruby on Rails"-like framework for Perl called "Catalyst" that should be able to build on the code classes I've already generated for my import and viewing tools. |