DataFission: Is It a Dusie?
December 26, 2016
I know that some millennials are not familiar with the Duesenberg automobile. Why would that generation care about an automobile manufacturer that went out of business in 1937. My thought is that the Duesenberg left one nifty artifact: The word doozy which means something outstanding.
I thought of the Duesenberg “doozy” when I read “Unstructured Data Search Engine Has Roots in HPC.” HPC means high performance computing. The acronym suggests a massively parallel system just like the one to which the average mobile phone user has access. The name of the search engine is “Duse,” which here in Harrod’s Creek is pronounced “doozy.”
According to the write up:
One company hoping to tap into the morass of unstructured data is DataFission. The San Jose, California firm was founded in 2013 with the goal of productizing a scale-out search engine , called the Digital Universe Search Engine, or DUSE, that it claims can index just about any piece of data, and make it searchable from any Web-enabled device.
The key to Duse is pattern matching. This is a pretty good method; for example, Brainware used trigrams to power its search system. Since the company disappeared into Lexmark, I am not sure what happened to the company’s system. I think the n-gram patent is owned by a bank located near an abandoned Kodak facility.
The method of the system, as I understand it, is:
- Index content
- Put index into compressed tables
- Allow users to search the index.
The users can “search” by entering queries or dragging “images, videos, or audio files into Duse’s search bar or programmatically via REST APIs.”
What differentiates Duse? The write up states:
The secret sauce lies in how the company indexes the data. A combination of machine learning techniques, such as principal component analysis (PCA), clustering, and classification algorithms, as well as graph link analysis and “nearest neighbor” approach help to find associations in the data.
Dr. Harold Trease, the architect of the Duse system, says:
We generate a high-dimensional signature, a high-dimensional feature vector, that quantifies the information content of the data that we read through,” he says. “We’re not looking for features like dogs or cats or buildings or cars. We’re quantifying the information content related to the data that we read. That’s what we index and put in a database. Then if you pull out a cell phone and take a picture of the dog, we convert that to one of these high-dimensional signatures, and then we compare that to what’s in the database and we find the best matches.
He adds:
If we index a billion images, we’d end up with a billion points in this search space, and we can look at that search space it has structure to it, and the structure is fantastic. There’s all kinds these points and clusters and strands that connect things. It makes little less sense to humans, because we don’t see things like that. But to the code, it makes perfect sense.
The company’s technology dates from the 1990s and the search technology was part of the company’s medical image analysis and related research.
The write up reports:
The software itself, which today exists as a Python-based Apache Spark application, can be obtained as software product or fully configured on a hardware appliance called DataHunter.
For more information about the company, navigate to this link.
Stephen E Arnold, December 26, 2016