DataFission: Is It a Dusie?

December 26, 2016

I know that some millennials are not familiar with the Duesenberg automobile. Why would that generation care about an automobile manufacturer that went out of business in 1937. My thought is that the Duesenberg left one nifty artifact: The word doozy which means something outstanding.

Image result for duesenberg

I thought of the Duesenberg “doozy” when I read “Unstructured Data Search Engine Has Roots in HPC.” HPC means high performance computing. The acronym suggests a massively parallel system just like the one to which the average mobile phone user has access. The name of the search engine is “Duse,” which here in Harrod’s Creek is pronounced “doozy.”

According to the write up:

One company hoping to tap into the morass of unstructured data is DataFission. The San Jose, California firm was founded in 2013 with the goal of productizing a scale-out search engine , called the Digital Universe Search Engine, or DUSE, that it claims can index just about any piece of data, and make it searchable from any Web-enabled device.

The key to Duse is pattern matching. This is a pretty good method; for example, Brainware used trigrams to power its search system. Since the company disappeared into Lexmark, I am not sure what happened to the company’s system. I think the n-gram patent is owned by a bank located near an abandoned Kodak facility.

The method of the system, as I understand it, is:

  1. Index content
  2. Put index into compressed tables
  3. Allow users to search the index.

The users can “search” by entering queries or dragging “images, videos, or audio files into Duse’s search bar or programmatically via REST APIs.”

What differentiates Duse? The write up states:

The secret sauce lies in how the company indexes the data. A combination of machine learning techniques, such as principal component analysis (PCA), clustering, and classification algorithms, as well as graph link analysis and “nearest neighbor” approach  help to find associations in the data.

Dr. Harold Trease, the architect of the Duse system, says:

We generate a high-dimensional signature, a high-dimensional feature vector, that quantifies the information content of the data that we read through,” he says. “We’re not looking for features like dogs or cats or buildings or cars. We’re quantifying the information content related to the data that we read. That’s what we index and put in a database. Then if you pull out a cell phone and take a picture of the dog, we convert that to one of these high-dimensional signatures, and then we compare that to what’s in the database and we find the best matches.

He adds:

If we index a billion images, we’d end up with a billion points in this search space, and we can look at that search space it has structure to it, and the structure is fantastic. There’s all kinds these points and clusters and strands that connect things. It makes little less sense to humans, because we don’t see things like that. But to the code, it makes perfect sense.

The company’s technology dates from the 1990s and the search technology was part of the company’s medical image analysis and related research.

The write up reports:

The software itself, which today exists as a Python-based Apache Spark application, can be obtained as software product or fully configured on a hardware appliance called DataHunter.

For more information about the company, navigate to this link.

Stephen E Arnold, December 26, 2016

Shorter Content Means Death for Scientific Articles

December 26, 2016

The digital age is a culture that subsists on digesting quick bits of information before moving onto the next.  Scientific journals are hardly the herald of popular trends, but in order to maintain relevancy with audiences the journals are pushing for shorter articles.  The shorter articles, however, presents a problem for the authors says Ars Technica in the, “Scientific Publishers Are Killing Research Papers.”

Shorter articles are also pushed because scientific journals have limited pages to print.  The journals are also pressured to include results and conclusions over methods to keep the articles short.  The methods, in fact, are usually published in another publication labeled supplementary information:

Supplementary information doesn’t come in the print version of journals, so good luck understanding a paper if you like reading the hard copy. Neither is it attached to the paper if you download it for reading later—supplementary information is typically a separate download, sometimes much larger than the paper itself, and often paywalled. So if you want to download a study’s methods, you have to be on a campus with access to the journal, use your institutional proxy, or jump through whatever hoops are required.

The lack of methodical information can hurt researchers who rely on the extra facts to see if it is relevant to their own work.  The shortened articles also reference the supplementary materials and without them it can be hard to understand the published results.  The shorter scientific articles may be better for general interest, but if they lack significant information than how can general audiences understand them?

In short, the supplementary material should be included online and should be easily accessed.

Whitney Grace, December 26, 2016

An Apologia for People. Big Data Are Just Peachy Keen

December 25, 2016

I read “Don’t Blame Big Data for Pollsters’ Failings.” The news about the polls predicting a victory for Hillary Clinton reached me in Harrod’s Creek five days after the election. Hey, Beyond Search is in rural Kentucky. It looks from the news reports and the New York Times’s odd letter about doing “real” journalism that the pundits predicted that the mare would win the US derby.

The write up explains that Big Data did not fail. The reason? The pollsters were not using Big Data. The sample sizes were about 1,000 people. Check your statistics book. In the back will be samples sizes for populations. If you have an older statistics book, you have to use the formula like


Big Data doesn’t fool around with formulas. Big Data just uses “big data.” Is the idea is that the bigger the data, the better the output?

The write up states that the problem was the sample itself: The actual humans.

The write up quotes a mid tier consultant from an outfit called Ovum which reminds me of eggs. I circled this statement:

“When you have data sets that are large enough, you can find signals for just about anything,” says Tony Baer, a big data analyst at Ovum. “So this places a premium on identifying the right data sets and asking the right questions, and relentlessly testing out your hypothesis with test cases extending to more or different data sets.”

The write up tosses in social media. Facebook takes the position that its information had minimal effect on the election. Nifty assertion that.

The solution is, as I understand the write up, to use a more real time system, different types of data, and math. The conclusion is:

With significant economic consequences attached to political outcomes, it is clear that those companies with sufficient depth of real-time behavioral data will likely increase in value.

My view is that hope and other distinctly human behaviors certainly threw an egg at reality. It is great to know that there is a fix and that Big Data emerge as the path forward. More work ahead for the consultants who often determine sample sizes by looking at Web sites like SurveySystem and get their sample from lists of contributors, a 20 something’s mobile phone contact list, or lists available from friends.

If you use Big Data, tap into real time streams of information, and do the social media mining—you will be able to predict the future. Sounds logical? Now about that next Kentucky Derby winner? Happy or unhappy holiday?

Stephen E Arnold, December 25, 2016

PageRank Revealed with a Superficial Glance

December 24, 2016

I read “19 Confirmed Google Ranking Factors.” The table below comes from my 2004 monograph The Google Legacy. You will be able to view a seven minute summary on December 20, 2016. The table in The Google Legacy table consists of more than 100 factors used in the Google relevance system. Each of the PageRank elements was extracted from open source information; for example, journal articles, Google technical papers which were once easily available from Google, patents, various speeches, and blog posts. We estimated that the factors are tuned and modified to deal with hacks, tricks, and new developments. Here is an extract from the tables in The Google Legacy:


Imagine my surprise when I worked through the 19 factors in the article “19 Confirmed Google Ranking Factors.” My research suggested that by 2004, Google had layered on and inserted many factors which the company did not document. These adjustments have continued since 2004 when production of The Google Legacy began and changes could not longer be made to the book text.

The idea that one can influence PageRank by paying attention to a handful of content, format, update, and technical requirements is interesting for two reasons:

  1. It continues the simplification of the way people think about Google and its methods
  2. Google faces “inertia” when it comes to making changes in its core relevance methods; that is, it is easier for Google to “wrap” the core with new features than it is to make certain changes. That’s the reason there is an index for mobile search and an index for desktop search.

Here’s an example of the current thinking about Google’s relevance ranking methods from the article cited in this blog post: Links. Yep, PageRank relies on links. Think about IBM Almaden Clever and you get a good idea how this works. What Google has added were methods which pay attention to less crude signals. Google also pays attention to signals which “deduct” or “down check” a page or site. Transgressions include duplicate content and crude tricks to fool Google’s algorithms; for example, you click on a link that says “White House” and you see porn. This issue and thousands of others have been “fixed” by Google engineers. My 2004 listing of 100 factors is a fraction of the elements the Google relevance systems process.

Another example of relevance simplification appears in “10 Google Search Ranking Factors Content Marketers Should Prioritize (And 3 You Shouldn’t).” Yep, almost 20 years of relevance tweaks boil down to a dozen rules. Hey, if these worked, why isn’t everyone in the SEO game generating oodles of traffic? Answer: The Google system is a bit more slippery and requires methods with more rubber studs on the SEO gym shoe.l

The problem with boiling down Google’s method to a handful of checkpoints is that the simplification can impart false confidence. Do this and the traffic does NOT materialize. What happened? The answer is that a misstep has been introduced while doing the “obvious” search engine optimization tweak. To give one example, consider making changes to one’s site. Google notes frequency and type of changes to a Web site. How about those frequent and radical redesigns. How does Mother Google interpret that information?

Manipulating relevance in order to boost a site’s ranking in a results list can have some interesting consequences. Over the years, I have stated repeatedly that if a webmaster wants traffic, buy AdWords. The other path is to concentrate on producing content which other people want to read. Shortcuts and tricks can lead to some fascinating consequences and, of course, work for the so called search engine optimization experts.

Matt, Matt, where are you now? Oh, that’s right…

Stephen E Arnold, December 24, 2016

Google Buys Image Search: Invention Out

December 23, 2016

I read “Google Buys Shopping Search Startup to Make Images More Lucrative.” The Alphabet Google thing has been whacking away at image search for more than a decade. I have wondered why the GOOG’s whiz kids cannot advance beyond fiddling with the interface. Useful ways to slice and dice images are lacking at Google, but other vendors have decided to build homes on the same technical plateau. Good enough is the watchword for most information search and retrieval systems today.

The news that the Google is buying yet another outfit comes as no surprise. Undecidable  Labs, founded by a denizen of Apple, wants to make it easy to see something and buy it.

Innovation became very hard for the Alphabet Google thing once it had cherry picked the low hanging fruit from research labs, failed Web search systems, and assorted disaffected employees from search, hardware, and content processing companies.

Now innovation comes from buying outfits that are nimble, think outside the Google box, and have something that is sort of real. According to the write up:

The acquisition suggests that Google, the largest unit of Alphabet Inc., is making further moves to tie its massive library of online image links with a revenue stream.

eBay is paddling into the same lagoon. The online flea market wants to make it easy for me to spot a product I absolutely must have, right now. Click it and be transported to an eBay page so I can buy that item. Google seems to be thinking along a similar line, just without the “old” system up and running. Google’s angle will make an attempt to hook a search into a product sale. Think of Google as an intermediary or broker, not a digital store with warehouses. Yikes, overhead. No way at the GOOG. Not logical, right?

Earlier efforts around online commerce have delivered mixed results at Google. The company’s mobile payments have yet to see significant pickup. Its comparison shopping service, which facilitates online purchases within search results, has growing traction with advertisers, according to external estimates.

Perhaps one asset for the GOOG is that the founder is Cathy Edwards. I wonder if she wears blue jeans and a black turtle neck. What are the odds she uses an Apple iPhone?

Stephen E Arnold, December 23, 2016

On-Demand Business Model Not Sure Cash Flow

December 23, 2016

The on-demand car service Uber established a business model that startups in Silicon Valley and other cities are trying to replicate.  These startups are encountering more overhead costs than they expected and are learning that the on-demand economy does not generate instant cash flow.  The LA Times reports that, “On-Demand Business Models Have Put Some Startups On Life Support.”

Uber uses a business model revolving around independent contractors who use their own vehicles as a taxi service that responds to individual requests.  Other startups have sprung up around the same on-demand idea, but with a variety of services.  These include flower delivery service BloomThat, on-demand valet parking Zirx, on-demand meals Spoonrocket, and housecleaning with Homejoy.  The problem these on-demand startups are learning is that they have to deal with overhead costs, such as renting storage spaces, parking spaces, paying for products, delivery vehicles, etc.

Unlike Uber, which relies on the independent contractor to cover the costs of vehicles, other services cannot rely on the on-demand business model due to the other expenses.  The result is that cash is gushing out of their companies:

It’s not just companies that are waking up to the fact being “on-demand” doesn’t guarantee success — the investor tide has also turned.  As the downturn leads to more cautious investment, on-demand businesses are among the hardest-hit; funding for such companies fell in the first quarter of this year to $1.3 billion, down from $7.3 billion six months ago.  ‘If you look in venture capital markets, the on-demand sector is definitely out of favor,’ said Ajay Chopra, a partner at Trinity Ventures who is an investor in both Gobble and Zirx.

These new on-demand startups have had to change their business models in order to remain in business and that requires dismantling the on-demand service model.  On-demand has had its moment in the sun and will remain a lucrative model for some services, but until we invent instant teleportation most companies cannot run on that model.

Whitney Grace, December 23, 2016

Smart Software: An Annoying Flaw Will Not Go Away

December 22, 2016

Machines May Never Master the Distinctly Human Elements of Language” captures one of the annoying flaws in smart software. Machines are not human—at least not yet. The write up explains that “intelligence is mysterious.” Okay, big surprise for some of the Sillycon Valley crowd.

The larger question is, “Why are some folks skeptical about smart software and its adherents’ claims?” Part of the reason is that publications have to show some skepticism after cheerleading.  Another reason is that marketing presents a vision of reality which often runs counter to one’s experience. Try using that voice stuff in a noisy subway car. How’s that working out?

The write up caught my attention with this statement from the Google, one of the leaders in smart software’s ability to translate human utterances:

“Machine translation is by no means solved. GNMT can still make significant errors that a human translator would never make, like dropping words and mistranslating proper names or rare terms, and translating sentences in isolation rather than considering the context of the paragraph or page.”

The write up quotes a Stanford wizard as saying:

She [wizard Li] isn’t convinced that the gap between human and machine intelligence can be bridged with the neural networks in development now, not when it comes to language. Li points out that even young children don’t need visual cues to imagine a dog on a skateboard or to discuss one, unlike machines.

My hunch is that quite a few people know that smart software works in some use cases and not in others. The challenge is to get those with vested interests and the marketing millennials to stick with “as is” without confusing the “to be” with what can be done with available tools. I am all in on research computing, but the assertions of some of the cheerleaders spell S-I-L-L-Y. Louder now.

Stephen E Arnold, December 22, 2016

Bank App Does Not Play Well with Tor Browser

December 22, 2016

Bank apps are a convenient way to access and keep track of your accounts.  They are mainly used on mobile devices and are advertised for the user on the go.  One UK bank app, however, refuses to play nice with devices that have the Tor browser, reports the Register in the article, “Tor Torpedoed!  Tesco Bank App Won’t Run With Privacy Tool Installed.”

Tesco is a popular bank present in supermarkets, but if you want to protect your online privacy by using the Tor browser on your mobile device the Tesco app will not work on said device.  Marcus Davage, the mainframe database administrator, alerted Tesco patrons that in order to use the Tesco app, they needed to delete the Tor browser.  Why is this happening?

The issue appears to be related to security. Tesco’s help site notes that the Android app checks for malware and other possible security risks (such as the phone being rooted) upon launching and, in this case, the Tor software triggers an alert.  The Tor Project makes two apps for Android, the aforementioned Orbot and the Orfox browser, both of which allow users to encrypt their data traffic using the Tor network. According to the Play Store, Orbot has been downloaded more than five million times by Android users.

App developers need to take into account that the Tor browser is not malware.  Many users are concerned with their online privacy and protecting their personal information, so Tor needs to be recognized as a safe application.

Whitney Grace, December 22, 2016

Palantir Factoid: 2016 Government Contract Value

December 21, 2016

I read “Palantir CEO at Trump-Tech Summit Raises Red Flags.” The idea is that Palantir is a peanut when compared to publicly traded giants like IBM and Microsoft. The presence of Peter Thiel, an adviser to the Trumpeteers, adds some zip to both Facebook and Palantir. But Palantir’s Alex Karp was at the meeting as well. The idea is that the Trumpeteers continue to get stereophonic inputs about technology and other matters.

This is the factoid which caught my attention. I assume, of course, that everything I read online is dead center accurate:

Palantir received about $83 million from the government this year tied to 71 transactions, according to

What happens to Palantir’s bookings if some changes to the DCGS program come down the pike? Perhaps Palantir will be running some meetings at which giants like IBM are going to be eager participants. On the other hand, IBM and some of the folks at the Trumpeteers’ technology summit might not be happy.

Net net: I was dismayed at the modest bookings Palantir has garnered. I expected heftier numbers.

Stephen E Arnold, December 21, 2016

Lucidworks Sees Watson as a Savior

December 21, 2016

Lucidworks (really?). A vision has appeared to the senior managers of Lucidworks, an open source search outfit which has ingested $53 million and sucked in another $6 million in debt financing in June 2016. Yep, that Lucidworks. The “really” which the name invokes is an association I form when someone tells me that commercializing open source search is going to knock off the pesky Elastic of Elasticsearch fame while returning a juicy payoff to the folks who coughed up the funds to keep the company founded in 2007 chugging along. Yep, Lucid works. Sort of, maybe.

I read “Lucidworks Integrates IBM Watson into Fusion Enterprise Discovery Platform.” The write up explains that Lucidworks is “tapping into” the IBM Watson developer cloud. The write up explains that Lucidworks has:

an application framework that helps developers to create enterprise discovery applications so companies can understand their data and take action on insights.

Ah, so many buzzwords. Search has become applications. “Action on insights” puts some metaphorical meat on the bones of Solr, the marrow of Lucidworks. Really?

With Watson in the company’s back pocket, Lucidworks will deliver. I learned:

Customers can rely on Fusion to develop and deploy powerful discovery apps quickly thanks to its advanced cognitive computing features and machine learning from Watson. Fusion applies Watson’s machine learning capabilities to an organization’s unique and proprietary mix of structured and unstructured data so each app gets smarter over time by learning to deliver better answers to users with each query. Fusion also integrates several Watson services such as Retrieve and Rank, Speech to Text, Natural Language Classifier, and AlchemyLanguage to bolster the platform’s performance by making it easier to interact naturally with the platform and improving the relevance of query results for enterprise users.

But wait. Doesn’t Watson perform these functions already. And if Watson comes up a bit short in one area, isn’t IBM-infused Yippy ready to take up the slack?

That question is not addressed in the write up. It seems that the difference between Watson, its current collection of partners, and affiliated entities like Yippy are vast. The write up tells me:

customers looking for hosted, pre-tuned machine learning and natural language processing capabilities can point and click their way to building sophisticated applications without the need for additional resources. By bringing Watson’s cognitive computing technology to the world of enterprise data apps, these discovery apps made with Fusion are helping professionals understand the mountain of data they work with in context to take action.

This sounds like quite a bit of integration work. Lucidworks. Really?

Stephen E Arnold, December 21, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta