<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8776943</id><updated>2011-12-14T18:51:20.796-08:00</updated><title type='text'>Contextual Searching</title><subtitle type='html'>Searching for the context of a message using various techniques and algorithms.  Everything from Vector Search to Latent Semantic Indexing (LSI) to Contextual Network Graphs (CNG) will be discussed.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://mdlucas.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8776943/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://mdlucas.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Marshall Lucas</name><uri>http://www.blogger.com/profile/06445941101504445085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>3</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8776943.post-113865741266884816</id><published>2006-01-30T13:34:00.000-08:00</published><updated>2006-01-30T13:43:32.706-08:00</updated><title type='text'>Finding the needle in a haystack</title><content type='html'>What's this all about?  Why do we need a way to search raw text for information?  What sort of information can we expect to glean from raw text?&lt;br /&gt;&lt;br /&gt;These are but a few of the questions raised about LSI and other contextual search techniques.  The simplest answer is that we can better search the plethora of websites if we can find them based on context rather than raw keywords.  Regional differences in how we tend to explain our topic are but a small fraction of the reasons for this.  We all tend to phrase things differently and with languages being the living, changing things that they are, there are new ways of saying the same thing emerging daily.  This is where contextual search can shine.  What we are looking for is the "best" fit for our search so that we can spend less time researching a topic and more time learning about it. &lt;br /&gt;&lt;br /&gt;A quick example: I want to research a given investment into bio-tech stocks.  I pick a few major contenders I have read about and I begin looking on the Internet to find out more about them.  I find a few good articles, but I have to read through thousands of bad ones to find the good ones.  Now, if I could only take the several thousand articles from a more general search engine (like Google or MSN) and further "sort" the articles based on the overall theme of them all, the cream would rise to the top.  This is what a Contextual Network Graph does, given the proper tuning.  So, what we want is a quick search and retreive from popular search engines, then we want to process those documents in some way that would cause sentences, paragraphs, etc. to bubble to the surface that had the most content.  We might even want to apply constraints or further search criteria to limit want we get back.  By doing this we manage to read far fewer documents to find the "good" one we want.  The proverbial needle in the haystack.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8776943-113865741266884816?l=mdlucas.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mdlucas.blogspot.com/feeds/113865741266884816/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8776943&amp;postID=113865741266884816' title='43 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8776943/posts/default/113865741266884816'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8776943/posts/default/113865741266884816'/><link rel='alternate' type='text/html' href='http://mdlucas.blogspot.com/2006/01/finding-needle-in-haystack.html' title='Finding the needle in a haystack'/><author><name>Marshall Lucas</name><uri>http://www.blogger.com/profile/06445941101504445085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>43</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8776943.post-113043642129792512</id><published>2005-10-27T10:54:00.000-07:00</published><updated>2005-10-27T11:07:01.316-07:00</updated><title type='text'>Why not Latent Semantic Indexing (LSI)?</title><content type='html'>In short: It's patented by Telcordia &lt;a href="http://lsi.research.telcordia.com/"&gt;http://lsi.research.telcordia.com/&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;While that is reason enough, there are many other reasons not to use LSI.  For one it's speed it inexorably linked to the size of the corpus.  The larger the corpus the longer it takes to add, delete or modify an item within the corpus and the longer it takes to search the corpus.  However, it does work quite well and returns a reasonable search result, if you are patient enough to wait for it.  Also, the larger the corpus the more raw processing power you need to crunch the data.  Because it makes use of matrix algebra, in particular, Singular Value Decomposition (SVD), it has to do a ton of math on everything it touches.  And, each item added to the corpus requires a complete recalculation of the matrix using SVD.  Of course, there are shortcuts that allow small amounts of changes without a complete recalculation, but there's a threshold amount that once met, the entire corpus must be reprocessed.&lt;br /&gt;&lt;br /&gt;This is why I went searching for a better method that was less math intesive and could run on a  2GHz PC in a reasonable amount of time.  Vector search was interesting but required comparison to every item to get a score for matching.  Contextual Network Graphs (CNG) turned out to be the best way to accomplish my goal of a simple, straight forward method of crunching large amounts of information in a reasonable amount of time and indexing it for context based searching.  Watch for my next post for more on how this works.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8776943-113043642129792512?l=mdlucas.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mdlucas.blogspot.com/feeds/113043642129792512/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8776943&amp;postID=113043642129792512' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8776943/posts/default/113043642129792512'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8776943/posts/default/113043642129792512'/><link rel='alternate' type='text/html' href='http://mdlucas.blogspot.com/2005/10/why-not-latent-semantic-indexing-lsi.html' title='Why not Latent Semantic Indexing (LSI)?'/><author><name>Marshall Lucas</name><uri>http://www.blogger.com/profile/06445941101504445085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8776943.post-112982453026952821</id><published>2005-10-20T08:44:00.000-07:00</published><updated>2005-10-20T09:08:50.296-07:00</updated><title type='text'>Overview</title><content type='html'>Traditional search methodologies are defined by the keywords selected from the individual documents that are added to the search base (called a corpus).  Much has been done to optimize this approach and make it more generic, give it a more robust ability to check spelling and offer suggestions for better search criteria.  What is still being researched is the ability to provide a true contextual comparison of the search criteria with the corpus. &lt;br /&gt;&lt;br /&gt;Contextual search methods are many and varied, but two seem to be bubbling to the top right now: Latent Semantic Indexing (LSI) and Contextual Networks (CN) of various types.  Vector search was an early attempt to find this sort of information and is the basis for other methods as well.  When it's all boiled down most contextual search methods use some sort of vector analysis, it's the preprosing and post processing that makes the difference.&lt;br /&gt;&lt;br /&gt;One method that is very different is the use of Contextual Network Graphs (CNG) which contruct a multidimensional search graph similar to a Social Network Graph to search for documents that are contextual similar to the search criteria, or even one to another (clustering).  CNGs are very useful as their corpora can be extended or otherwise modified on the fly without the need to recalculate the entire index structure.  LSI can, likewise, be modified on the fly without a full recalculation, but only to a set point and then the whole thing must be recalculated to maintain index integrity.  The typical implementation of CNGs recalculates a large portion after each new item is added to the corpus, but my technique does not require this, only an updating of the base index counts as the remainder of the calculation is completed during traversal of the graph. &lt;br /&gt;&lt;br /&gt;Calculating the final term weights on the fly is accomplished by simplifying the equations used to calculate the weights and then updating simple sums during addition of new items to the index, or removal of old items for that matter.    Simple calculations are left to be handled during traversal that accounts for the terms global weight or it's entropy value.  This splits the task in half and cuts the time by even more as only the small portion of the graph that is traversed is calculated during the traversal.  I have found this method to be extremely efficient in practice.  I have yet to compare it to the more traditional methods, but it definitely outstrips LSI during load time and takes much less processor power for the mathematical calculations and requires zero matrix algebra operations.&lt;br /&gt;&lt;br /&gt;I hope to present more information as time goes by regarding my research into this type of contextual searching.  Please check in often and visit the Google Ads to help support this endevour.&lt;br /&gt;&lt;br /&gt;Marshall Lucas&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8776943-112982453026952821?l=mdlucas.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mdlucas.blogspot.com/feeds/112982453026952821/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8776943&amp;postID=112982453026952821' title='32 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8776943/posts/default/112982453026952821'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8776943/posts/default/112982453026952821'/><link rel='alternate' type='text/html' href='http://mdlucas.blogspot.com/2005/10/overview.html' title='Overview'/><author><name>Marshall Lucas</name><uri>http://www.blogger.com/profile/06445941101504445085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>32</thr:total></entry></feed>
