Why not Latent Semantic Indexing (LSI)?
In short: It's patented by Telcordia http://lsi.research.telcordia.com/.
While that is reason enough, there are many other reasons not to use LSI. For one it's speed it inexorably linked to the size of the corpus. The larger the corpus the longer it takes to add, delete or modify an item within the corpus and the longer it takes to search the corpus. However, it does work quite well and returns a reasonable search result, if you are patient enough to wait for it. Also, the larger the corpus the more raw processing power you need to crunch the data. Because it makes use of matrix algebra, in particular, Singular Value Decomposition (SVD), it has to do a ton of math on everything it touches. And, each item added to the corpus requires a complete recalculation of the matrix using SVD. Of course, there are shortcuts that allow small amounts of changes without a complete recalculation, but there's a threshold amount that once met, the entire corpus must be reprocessed.
This is why I went searching for a better method that was less math intesive and could run on a 2GHz PC in a reasonable amount of time. Vector search was interesting but required comparison to every item to get a score for matching. Contextual Network Graphs (CNG) turned out to be the best way to accomplish my goal of a simple, straight forward method of crunching large amounts of information in a reasonable amount of time and indexing it for context based searching. Watch for my next post for more on how this works.
While that is reason enough, there are many other reasons not to use LSI. For one it's speed it inexorably linked to the size of the corpus. The larger the corpus the longer it takes to add, delete or modify an item within the corpus and the longer it takes to search the corpus. However, it does work quite well and returns a reasonable search result, if you are patient enough to wait for it. Also, the larger the corpus the more raw processing power you need to crunch the data. Because it makes use of matrix algebra, in particular, Singular Value Decomposition (SVD), it has to do a ton of math on everything it touches. And, each item added to the corpus requires a complete recalculation of the matrix using SVD. Of course, there are shortcuts that allow small amounts of changes without a complete recalculation, but there's a threshold amount that once met, the entire corpus must be reprocessed.
This is why I went searching for a better method that was less math intesive and could run on a 2GHz PC in a reasonable amount of time. Vector search was interesting but required comparison to every item to get a score for matching. Contextual Network Graphs (CNG) turned out to be the best way to accomplish my goal of a simple, straight forward method of crunching large amounts of information in a reasonable amount of time and indexing it for context based searching. Watch for my next post for more on how this works.

2 Comments:
At 11:19 PM,
volleyball clipart said…
people clipart and people clipart are both something to consider.
At 4:43 AM,
Anonymous said…
what about the patent of lsi?
do we have to give them some money to use the lsi algorithm.
what about write a paper, or make a commerce software?
Post a Comment
<< Home