Textio is a Promising Text Analysis Startup
November 6, 2014
Here’s an interesting development from the world of text-processing technology. GeekWire reports, “Microsoft and Amazon Vets Form Textio, a New Startup Looking to Discover Patterns in Documents.” The new company expects to release its first product next spring. Writer John Cook tells us:
“Kieran Snyder, a linguistics expert who previously worked at Amazon and Microsoft’s Bing unit, and Jensen Harris, who spent 16 years at Microsoft, including stints running the user experience team for Windows 8, have a formed a new data visualization startup by the name of Textio.
“The Seattle company’s tagline: ‘Turn business text into insights.’ The emergence of the startup was first reported by Re/code, which noted that the Textio tool could be used by companies to scour job descriptions, performance reviews and other corporate HR documents to uncover unintended discrimination. In fact, Textio was formed after Snyder conducted research on gender bias in performance reviews in the tech industry.”
That is an interesting origin, especially amid the discussions about gender that currently suffuse the tech community. Textio sees much room for improvement in text analytics, and hopes to help clients reach insights beyond those competing platforms can divine. CEO Snyder’s doctorate and experience in linguistics and cognitive science should give the young company an edge in the competitive field.
Cynthia Murrell, November 06, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Altegrity Kroll: Under Financial Pressure
October 30, 2014
Most of the name surfing search experts—like the fellow who sold my content on Amazon without my permission and used my name to boot—will not recall much about Engenium. That’s no big surprise. Altegrity Kroll owns the pioneering company in the value-added indexing business. Altegrity, as you may know, is the owner of the outfit that cleared Edward Snowden for US government work.
I read “Snowden Vetter Altegrity’s Loans Plunge: Distressed Debt”. In that article I learned:
Altegrity Inc., the security firm that vetted former intelligence contractor Edward Snowden, has about six months until it runs out of money as the loss of background-check contracts negate most of a July deal with lenders to extend maturities for five years.
The article reports that “selective default” looms for the company. With the lights flickering at a number of search and content processing firms, I hope that the Engenium technology survives. The system remains a leader in a segment which has a number of parvenus.
Stephen E Arnold, October 30, 2014
Amazon Learns from XML Adventurers
October 10, 2014
I recall learning a couple of years ago that Amazon was a great place to store big files. Some of the XML data management systems embraced the low prices and pushed forward with cloud versions of their services.
When I read “Amazon’s DynamoDB Gets Hugely Expanded Free Tier And Native JSON Support,” I formed some preliminary thoughts. The trigger was this passage in the write up:
many new NoSQL and relational databases (including Microsoft’s DocumentDB service) now use JSON-style document models. DynamoDB also allowed you to store these documents, but developers couldn’t directly work with the information stored in them. That’s changing today. With this update, developers can now use the AWS SDKs for Java, .NET, Ruby and JavaScript to easily map their JSON data to DynamoDB’s own data types. That turns DynamoDB in a fully-featured document store and is going to make life easier for many developers on the platform.
Is JSON better than XML? Is JSON easier to use than XML? Is JSON development faster than XML? Ask an XML rock star and the answer is probably, “You crazy.” I can hear the guitar riff from Joe Walsh now.
Ask a 20 year old in a university programming class, and the answer may be different. I asked the 20 something sitting in my office about XML and he snorted: “Old school, dude.” I hire only people with respect for their elders, of course.
Here are the thoughts that flashed through my 70 year old brain:
- Is Amazon getting ready to make a push for the customers of Oracle, MarkLogic, and other “real” database systems capable of handling XML?
- Will Amazon just slash prices, take the business, and make the 20 year old in my office a customer for life just because Amazon is “new school”?
- Will Amazon’s developer love provide the JSON fan with development tools, dashboards, features, and functions that push clunky methods like proprietary Xquery messages into a reliquary?
No answers… yet.
Stephen E Arnold, October 10, 2014
xx
New Spin for Unstructured Content
October 6, 2014
I read “Objective Announces Partnership with Active Navigation, Helping Organisations Reduce Unstructured Content by 30%.” The angle is different from some content marketing outfits’ approach to search. Instead of dragging out the frequently beaten horse “all available information,” the focus is trimming the fat.
According to the write up:
The Active Navigation and Objective solution, allows organisations to quickly identify the important corporate data from their ROT information, then clean and migrate it into a fully compliant repository
How does the system work? I learned:
Following the Objective Information Audit, in which all data will be fully cleansed, an organisation’s content will be free of ROT information and policy violations. Immediate results from an Objective Information Audit typically deliver storage reductions of 30% to 50%, lowering the total cost of ownership for storage, while improving the performance of enterprise search, file servers, email servers and information repositories.
Humans plus smart software do the trick. In my view, this is an acknowledgement that the combination of subject matter experts plus software deliver a useful solution. The approach does work as long as the humans do not suffer “indexing fatigue” or budget cuts. Working time can also present a challenge. Management can lose patience.
Will organizations embrace an approach familiar to those who used decades old systems? Since some state of the art automated systems have delivered mixed results, perhaps a shift in methods is worth a try.
Stephen E Arnold, October 6, 2014
Autumn Approaches: Time for Realism about Search
September 1, 2014
Last week I had a conversation with a publisher who has a keen interest in software that “knows” what content means. Armed with that knowledge, a system can then answer questions.
The conversation was interesting. I mentioned my presentations for law enforcement and intelligence professionals about the limitations of modern and computationally expensive systems.
Several points crystallized in my mind. One of these is addressed, in part, in a diagram created by a person interested in machine learning methods. Here’s the diagram created by SciKit:
The diagram is designed to help a developer select from different methods of performing estimation operations. The author states:
Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems. The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.
First, notice that there is a selection process for choosing a particular numerical recipe. Now who determines which recipe is the right one? The answer is the coding chef. A human exercises judgment about a particular sequence of operation that will be used to fuel machine learning. Is that sequence of actions the best one, the expedient one, or the one that seems to work for the test data? The answer to these questions determines a key threshold for the resulting “learning system.” Stated another way, “Does the person licensing the system know if the numerical recipe is the most appropriate for the licensee’s data?” Nah. Does a mid tier consulting firm like Gartner, IDC, or Forrester dig into this plumbing? Nah. Does it matter? Oh, yeah. As I point out in my lectures, the “accuracy” of a system’s output depends on this type of plumbing decision. Unlike a backed up drain, flaws in smart systems may never be discerned. For certain operational decisions, financial shortfalls or the loss of an operation team in a war theater can be attributed to one of many variables. As decision makers chase the Silver Bullet of smart, thinking software, who really questions the output in a slick graphic? In my experience, darned few people. That includes cheerleaders for smart software, azure chip consultants, and former middle school teachers looking for a job as a search consultant.
Second, notice the reference to a “rough guide.” The real guide is understanding of how specific numerical recipes work on a set of data that allegedly represents what the system will process when operational. Furthermore, there are plenty of mathematical methods available. The problem is that some of the more interesting procedures lead to increased computational cost. In a worst case, the more interesting procedures cannot be computed on available resources. Some developers know about N=NP and Big O. Others know to use the same nine or ten mathematical procedures taught in computer science classes. After all, why worry about math based on mereology if the machine resources cannot handle the computations within time and budget parameters? This means that most modern systems are based on a set of procedures that are computationally affordable, familiar, and convenient. Does this similar of procedures matter? Yep. The generally squirrely outputs from many very popular systems are perceived as completely reliable. Unfortunately, the systems are performing within a narrow range of statistical confidence. Stated in a more harsh way, the outputs are just not particularly helpful.
In my conversation with the publisher, I asked several questions:
- Is there a smart system like Watson that you would rely upon to treat your teenaged daughter’s cancer? Or, would you prefer the human specialist at the Mayo Clinic or comparable institution?
- Is there a smart system that you want directing your only son in an operational mission in a conflict in a city under ISIS control? Or, would you prefer the human-guided decision near the theater about the mission?
- Is there a smart system you want managing your retirement funds in today’s uncertain economy? Or, would you prefer the recommendations of a certified financial planner relying on a variety of inputs, including analyses from specialists in whom your analyst has confidence?
When I asked these questions, the publisher looked uncomfortable. The reason is that the massive hyperbole and marketing craziness about fancy new systems creates what I call the Star Trek phenomenon. People watch Captain Kirk talking to devices, transporting himself from danger, and traveling between far flung galaxies. Because a mobile phone performs some of the functions of the fictional communicator, it sure seems as if many other flashy sci-fi services should be available.
Well, this Star Trek phenomenon does help direct some research. But in terms of products that can be used in high risk environments, the sci-fi remains a fiction.
Believing and expecting are different from working with products that are limited by computational resources, expertise, and informed understanding of key factors.
Humans, particularly those who need money to pay the mortgage, ignore reality. The objective is to close a deal. When it comes to information retrieval and content processing, today’s systems are marginally better than those available five or ten years ago. In some cases, today’s systems are less useful.
I2E Semantic Enrichment Unveiled by Linguamatics
July 21, 2014
The article titled Text Analytics Company Linguamatics Boosts Enterprise Search with Semantic Enrichment on MarketWatch discusses the launch of 12E Semantic Enrichment from Linguamatics. The new release allows for the mining of a variety of texts, from scientific literature to patents to social media. It promises faster, more relevant search for users. The article states,
“Enterprise search engines consume this enriched metadata to provide a faster, more effective search for users. I2E uses natural language processing (NLP) technology to find concepts in the right context, combined with a range of other strategies including application of ontologies, taxonomies, thesauri, rule-based pattern matching and disambiguation based on context. This allows enterprise search engines to gain a better understanding of documents in order to provide a richer search experience and increase findability, which enables users to spend less time on search.”
Whether they are spinning semantics for search, or if it is search spun for semantics, Linguamatics has made their technology available to tens of thousands of users of enterprise search. Representative John M. Brimacombe was straightforward in his comments about the disappointment surrounding enterprise search, but optimistic about 12E. It is currently being used by many top organizations, as well as the Food and Drug Administration.
Chelsea Kerwin, July 21, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Information Manipulation: Accountability Pipe Dream
July 5, 2014
I read an article with what I think is the original title: “What does the Facebook Experiment Teach us? Growing Anxiety About Data Manipulation.” I noted that the title presented on Techmeme was “We Need to Hold All Companies Accountable, Not Just Facebook, for How They Manipulate People.” In my view, this mismatch of titles is a great illustration of information manipulation. I doubt that the writer of the improved headline is aware of the irony.
The ubiquity of information manipulation is far broader than Facebook twirling the dials of its often breathless users. Navigate to Google and run this query:
cloud word processing
Note anything interesting in the results list displayed for me on my desktop computer:
The number one ad is for Google. In the first page of results, Google’s cloud word processing system is listed three more times. I did not spot Microsoft Office in the cloud except in item eight: Is Google Docs Making Microsoft Word Redundant.
For most Google search users, the results are objective. No distortion evident.
Here’s what Yandex displays for the same query:
No Google word processing and no Microsoft word processing whether in the cloud or elsewhere.
When it comes to searching for information, the notion that a Web indexing outfit is displaying objective results is silly. The Web indexing companies are in the forefront of distorting information and manipulating users.
Flash back to the first year of the Bush administration when Richard Cheney was vice president. I was in a meeting where the request was considered to make sure that the vice president’s office Web site would appear in FirstGov.gov hits in a prominent position. This, gentle reader, is a request that calls for hit boosting. The idea is to write a script or configure the indexing plumbing to make darned sure a specific url or series of documents appears when and where they are required. No problem, of course. We created a stored query for the Fast Search & Transfer search system and delivered what the vice president wanted.
This type of results manipulation is more common than most people accept. Fiddling Web search, like shaping the flow of content on a particular semantic vector, is trivial. Search engine optimization is a fools’ game compared with the tried and true methods of weighting or just buying real estate on a search results page, a Web site from a “real” company.
The notion that disinformation, reformation, and misinformation will be identifiable, rectified, and used to hold companies accountable is not just impossible. The notion itself reveals how little awareness of the actual methods of digital content injection work.
How much of the content on Facebook, Twitter, and other widely used social networks is generated by intelligence professionals, public relations “professionals,” and folks who want to be perceived as intellectual luminaries? Whatever your answer, what data do you have to back up your number? At a recent intelligence conference in Dubai, one specialist estimated that half of the traffic on social networks is shaped or generated by law enforcement and intelligence entities. Do you believe that? Probably not. So good for you.
Amusing, but as someone once told me, “Ignorance is bliss.” So, hello, happy idealists. The job is identifying, interpreting, and filtering. Tough, time consuming work. Most of the experts prefer to follow the path of least resistance and express shock that Facebook would toy with its users. Be outraged. Call for action. Invent an algorithm to detect information manipulation. Let me know how that works out when you look for a restaurant and it is not findable from your mobile device.
Stephen E Arnold, July 5, 2014
Elasticsearch: Bulldozing Content Processing
June 7, 2014
When I left the intelligence conference in Prague, there were a number of companies in my graphic about open source search. When I got off the airplane, I edited my slide. Looks to me as if Elasticsearch has just bulldozed the search and content sector, commercialized open source group. I would not want to be the CEO of LucidWorks, Ikanow, or any other open sourcey search and content processing company this weekend.
I read “Elasticsearch Scores $70 Million to Help Sites Crunch Tons of Data Fast.” Forget the fact that Elasticsearch is built on Lucene and some home grown code. Ignore the grammar in “data fast.” Skip over the sports analogy “scores.” Dismiss the somewhat narrow definition of what Elasticsearch ELK can really deliver.
What’s important is the $70 million committed to Elasticsearch. Added to the $30 or $40 million the outfit had obtained before, we are looking at a $100 million bet on an open source search based business. Compare this to the trifling $40 million the proprietary vendor Coveo had gathered or the $30 million put on LucidWorks to get into the derby.
I have been pointing out that Elasticsearch has demonstrated that it had several advantages over its open source competitors; namely, developers, developers, and developers.
Now I want to point out that it has another angle of attack: money, money, and money.
With the silliness of the search and content processing vendors’ marketing over the last two years, I think we have the emergence of a centralizing company.
No, it’s not HP’s new cloudy Autonomy. No, it’s not the wonky Watson game and recipe code from IBM. No, it’s not the Google Search Appliance, although I do love the little yellow boxes.
I will be telling those who attend my lectures to go with Elasticsearch. That’s where the developers and the money are.
Stephen E Arnold, June 7, 2014
Watson: The Most Gifted Digital Chef Using Butternut Squash
May 30, 2014
Does silicon have taste buds? Do algorithms sniff the essence of Kentucky barbecue?
I read a darned amazing article called “I Tasted BBQ Sauce Made By IBM’s Watson, And Loved It.” The write up reports that IBM and partner Co.Design used the open source, home grown code, and massive database to whip up a recipe for grilling. IBM is going whole hog with the billion dollar baby Watson, which is supposed to be one of IBM’s revenue fountains any day now.
According the write up, which may or may not have the ingredients of a “real” news story:
Most BBQ sauces start with ingredients like vinegar, tomatoes, or even water, but IBM’s stands out from the get go. Ingredient one: White wine. Ingredient two: Butternut squash. The list contains more Eastern influences, such as rice vinegar, dates, cilantro, tamarind (a sour fruit you may know best from Pad Thai), cardamom (a floral seed integral to South Asian cuisine) and turmeric (the yellow powder that stained the skull-laden sets of True Detective) alongside American BBQ sauce mainstays molasses, garlic, and mustard.
And most important for the grillin’ fans in Harrod’s Creek, the author used the Watson concoction of tofu. I am not sure that the folks in Harrod’s Creek know what tofu is. I do know that the idea of creating a barbecue sauce without bourbon in it is a culinary faux pas. Splash tamarind on a couple of dead squirrels parked above the coals, and the friends of Daniel Boone may skin the offender and think about grillin’ something larger than a squirrel.
The author who is scoring the tofu and broccoli treat reports:
I test it again and again. Finally I just slather my plate in the stuff. It’s delicious–the best way I can describe it is as a Thai mustard sauce, or maybe the middle point between a BBQ sauce and a curry. Does that sound gross? I assure you that it isn’t…But as I mop my plate of the last drips of Bengali Butternut BBQ Sauce, contemplating the difference between a future in which computers addict us to the next Lean Cuisine and one where they attempt to eradicate us with Terminators, Napoleon’s old adage comes to mind: An army marches on its stomach. He–or that–who controls our stomachs controls it all.
Yes. From game show win to a tofu topping, IBM Watson is redefining search, corporate strategy, and the vocabulary of cuisine for tofu and broccoli lovers. Kentucky frshly killed and skinned grilled squirrel may not benefit.
Anyone who suggests that vendors of information retrieval technology have lost their keen marketing edge, you are not in touch with butternut squash and reality. Should the digital chefs Put Kentucky bourbon in Bengali Butternut BBQ Sauce? Myron Mixon, the winningest man in barbecue, may say, “That’s what I am talkin’ for my whole hog.” Couild IBM sponsor the barbecue cook off program? Mr. Mixon may be a lover of tamarind and tofu too.
Stephen E Arnold, May 30, 2014
Watson on the Move: Cognea
May 20, 2014
I wanted to associate Cognos with Cognea. Two different things. IBM’s Watson unit, according to “IBM Watson Acquires Artificial Intelligence Startup Cognea,” is beefing up its artificial intelligence capabilities. Facebook, Google, and other outfits are embracing the dreams of artificial intelligence like it is 1981 when Marvin Weinberger was giving talks about AI’s revolutionizing information processing. I have lost track of Marvin, although I recall his impassioned polemics, 30 years after hearing him lecture. Unfortunately I remain skeptical about “artificial intelligence” because Watson, as I understood the pitch after Jeopardy, was already super smart. I suppose Cognea can add some marketing credibility to Watson. That system is curing disease and performing wonders for the insurance industry, if I embrace the IBM public relations’ flow.
In my lectures about the Big O problem, I point out that many of today’s smartest systems (for example, Search2, to name one) implements clever methods to make well known numerical recipes run like a teenager who just gulped three cans of Jolt Cola followed by a Red Bull energy drink.
The reality is that there are more sophisticated mathematical tools available. The problem is that the systems available cannot exploit these algorithmic methods. I am pretty confident that Cognea tells a great story. I am even more confident that IBM will do the “Vivisimo” thing with whatever technology Cognea actually has. Without a concrete demo, benchmarks, and independent evaluations, I will remain skeptical about “a cognitive computing and conversational artificial intelligence platform.”
I am far more interested in the Cybertap technology that IBM acquired and seems to be keeping under wraps. Cybertap works. Artificial intelligence, well, it depends on how one defines “artificial” and “intelligence” doesn’t it?
Stephen E Arnold, May 20, 2014