|
From: Scott C. <ca...@cs...> - 2003-02-27 20:35:27
|
OK, I've installed postgresql 7.3.2 and redone much of my analysis.
Essentially, I still see a difference in index usage, but I suspect that
the points that Jonathan and Andy have made about how the planner does
its magic is the root cause. Here is a table summarizing the times for
each of the queries (in msec).
Query1 Query2
srcfeature_id Query1 Query2 (seqscan=0)
1 23720.67 47862.31 12.59 100.15
2 25177.47 47633.77 12.70 113.44
3 0.41 1.04
4 8.62 9.51
5 24527.67 47408.37 13.88 171.41
6 26281.33 48268.77 10.91 113.86
7 24484.78 47934.39 16.85 162.68
I believe that v7.3 is much better at caching results than the version
that I had been using, and all of these results are for the second time
the query had run. The order of magnitude difference is the same for
the first time the queries ran as well. For supplemental information,
here is a table showing the number of rows in featureloc by
srcfeature_id, with percent of the table and what assembly the
feature_id represents:
srcfeature_id | count
---------------+--------
1 | 316000 8.3% (X.3)
2 | 402695 10.6% (2L.3C)
3 | 0 0% (U.3)
4 | 27084 0.7% (4.3)
5 | 438746 11.5% (2R.3)
6 | 367159 9.7% (3L.3)
7 | 537342 14.1% (3R.3)
Which is all very interesting, but I believe still argues in favor of
using min/max coordinates for featureloc.
Scott
On Wed, 2003-02-26 at 15:27, David Emmert wrote:
> Hi Scott,
>
> Which Pg server are you using?
>
> Explaining your query1 on Pg 7.3 using srcfeature_id = 1, 4, 6, I'm not seeing
> any difference in index usage. Eg:
>
> explain select distinct f.name,fl.min,fl.max,fl.strand,f.type_id,f.feature_id
> from feature f, featureloc fl
> where
> fl.srcfeature_id = 6 and
> f.feature_id = fl.feature_id and
> fl.max >= 390956 and
> fl.min <= 393164
> ;
>
> -Dave
>
>
> From: Scott Cain <ca...@cs...>
> >> To: gmod schema <gmo...@li...>
> >> Subject: [Gmod-schema] Postgresql DBA-type question
> >>
> >> Hello,
> >>
> >> This is going to be fairly long.
> >>
> >> I am on a continuing quest to improve query performance for queries used
> >> frequently by gbrowse. The most common time consuming query is to find
> >> all of the features in a given range. To do that, there are two types
> >> of queries that can be used, which I will refer to as Query1 and Query2.
> >> Query1 uses the coordinate system I have advocated the last couple of
> >> weeks, with min and max columns and min < max always. (Those column
> >> names should be changed to avoid conflicts with restricted words.)
> >> Query2 uses a union of two queries that use the other coordinate system
> >> (nbeg and nend) combined with strand. Here are examples:
> >>
> >> Query1:
> >> select distinct f.name,fl.min,fl.max,fl.strand,f.type_id,f.feature_id
> >> from feature f, featureloc fl
> >> where
> >> fl.srcfeature_id = 1 and
> >> f.feature_id = fl.feature_id and
> >> fl.max >= 390956 and
> >> fl.min <= 393164
> >>
> >> Query2:
> >> select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id
> >> from feature f, featureloc fl
> >> where
> >> fl.srcfeature_id = 1 and
> >> f.feature_id = fl.feature_id and
> >> fl.nbeg <= 393164 and fl.nend >= 390956 and strand=1
> >> union
> >> select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id
> >> from feature f, featureloc fl
> >> where
> >> fl.srcfeature_id = 1 and
> >> f.feature_id = fl.feature_id and
> >> fl.nbeg >= 390956 and fl.nend <= 393164 and strand=-1
> >>
> >> Here is the odd thing that I can't explain: the performance (read:index
> >> usage) seems to vary by srcfeature_id. Here's a table showing what I
> >> mean (values are msec from explain analyze):
> >>
> >> src_id Query1 Query2
> >> 1 116433.34 209532.05
> >> 2 979.55 1039.81
> >> 4 36.76 251.25
> >> 5 369.95 901.55
> >> 6 108404.04 208322.54
> >> 7 798.28 1131.86
> >>
> >> Note that srcfeature_id 4 is much faster because there are relatively
> >> few features on it. The biggest difference is srcfeature_ids 1 and 6
> >> taking 3 orders of magnitude longer than the other queries. That is
> >> because the query planner uses seq scans when the srcfeature_id is 1 or
> >> 6, but for other srcfeature_ids, it uses the appropriate index (either
> >> featureloc_src_nbeg_nend or featureloc_src_min_max depending on the
> >> query). I am at a total loss as to why postgres would do this; can
> >> anyone with more experience with postgres enlighten me?
> >>
> >> Note that this was done with gadfly 3 (filename gadfly_chado_dump2.gz)
> >> if it matters to anyone. I will probably take this to the postgres
> >> mailing list as well to get there input.
> >>
> >> Thanks,
> >> Scott
> >>
> >>
> >> --
> >> ------------------------------------------------------------------------
> >> Scott Cain, Ph. D. ca...@cs...
> >> GMOD Coordinator (http://www.gmod.org/) 216-392-3087
> >> Cold Spring Harbor Laboratory
> >>
> >>
> >>
> >> -------------------------------------------------------
> >> This SF.net email is sponsored by: Scholarships for Techies!
> >> Can't afford IT training? All 2003 ictp students receive scholarships.
> >> Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more.
> >> www.ictp.com/training/sourceforge.asp
> >> _______________________________________________
> >> Gmod-schema mailing list
> >> Gmo...@li...
> >> https://lists.sourceforge.net/lists/listinfo/gmod-schema
> >>
> >>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Scholarships for Techies!
> Can't afford IT training? All 2003 ictp students receive scholarships.
> Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more.
> www.ictp.com/training/sourceforge.asp
> _______________________________________________
> Gmod-schema mailing list
> Gmo...@li...
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
--
------------------------------------------------------------------------
Scott Cain, Ph. D. ca...@cs...
GMOD Coordinator (http://www.gmod.org/) 216-392-3087
Cold Spring Harbor Laboratory
|