November 23, 2014
This morning I thought briefly about “Profanity Laced Academic Paper Exposes Scam Journal.” The Slashdot item comments about a journal write up filled with nonsense. The paper was accepted by the International Journal of Advanced Computer Technology. I have received requests for papers from similar outfits. I am not interested in getting on a tenure track. The notion of my paying someone to publish my writings does not resonate. I either sell my work or give it away in this blog or one of the others I have available to me.
The question in my mind ping ponged between two different ways to approach this “pay to say” situation.
First, the authors who are involved in academic pursuits: “Are these folks trying to get the prestige that comes from publishing in an academic journal?” My hunch is that the motivation is similar to the force that drives the fake data people.
Second, has the search engine optimization crowd infected otherwise semi-coherent individuals that a link—any link—is worth money?
Indexing systems have a spotty record of identifying weaponized, shaped, or distorted information. The fallback position for many vendors is that by processing large volumes of information, the outliers can be easily tagged and either ignored or disproved.
Sounds good. Does it work? Nope. The idea that open source content is “accurate” may be a false assumption. You can run queries on Bing, iSeek, Google, and Yandex for yourself. Check out information related to the Ebola epidemic or modern fighter aircraft. What’s correct? What’s hoo hah? What’s downright craziness? What’s filtered? Figuring out what to accept as close to the truth is expensive and time consuming. Not part of today’s business model in most organizations I fear.
Stephen E Arnold, November 23, 2014
November 20, 2014
The product article for SAS Text Miner on SAS Products offers some insight into the new element of SAS Enterprise Miner. SAS acquired Teragram and that “brand” has disappeared. Some of the graphics on the Text Miner page are reminiscent of SAP Business Objects’ Inxight look. The overview explains,
“SAS Text Miner provides tools that enable you to extract information from a collection of text documents and uncover the themes and concepts that are concealed in them. In addition, you can combine quantitative variables with unstructured text and thereby incorporate text mining with other traditional data mining techniques.SAS Text Miner is a component of SAS Enterprise Miner. SAS Enterprise Miner must be installed on the same machine.”
New features and enhancements for the Text Miner include support for English and German parsing and new functionality. For more information about the Text Miner, visit the Support Community available for users to ask questions and discover the best approaches for the analysis of unstructured data. SAS was founded in 1976 after the software was created at North Carolina State University for agricultural research. As the software developed, various applications became possible, and the company gained customers in pharmaceuticals, banks and government agencies.
Chelsea Kerwin, November 20, 2014
November 16, 2014
I had a conversation last week with a quite assured expert in content processing. I mentioned that I was 70 years old and would not attending a hippy dippy conference in New York. I elicited a chuckle.
I thought of this gentle dismissal of old stuff when I read “Old Scientific Papers Never Die, They Just Fade Away. Or They Used to.” The main idea of the article seems to be that “old” work can provide some useful factoids for the 20 somethings and 35 year old whiz kids who wear shirts with unclothed female on them. Couple a festive shirt with tattoo, and you have a microcosm of the specialists inventing the future.
Here’s a passage I noted:
“Our [Googlers] analysis indicates that, in 2013, 36% of citations were to articles that are at least 10 years old and that this fraction has grown 28% since 1990,” say Verstak and co. What’s more, the increase in the last ten years is twice as big as in the previous ten years, so the trend appears to be accelerating.
Quite an insight considering that much of the math used to deliver whizzy content processing is a couple of centuries old. I looked for a reference to Dr. Gene Garfield and did not notice one. Well, maybe he’s too old to be remembered. Should I send a link to the 20 something with whom I spoke? Nah, waste of time.
Stephen E Arnold, November 16, 2014
November 11, 2014
Through the News section of their website, eDigitalResearch announces a new partnership in, “eDigitalResearch Partner with Lexalytics on Real-Time Text Analytics Solution.” The two companies are integrating Lexalytics’ Salience analysis engine into eDigital’s HUB analysis and reporting interface. The write-up tells us:
“By utilising and integrating Lexalytics Salience text analysis engine into eDigitalResearch’s own HUB system, the partnership will provide clients with a real-time, secure solution for understanding what customers are saying across the globe. Able to analyse comments from survey responses to social media – in fact any form of free text – eDigitalResearch’s HUB Text Analytics will provide the power and platform to really delve deep into customer comments, monitor what is being said and alert brands and businesses of any emerging trends to help stay ahead of the competition.”
Based in Hampshire, U.K., eDigitalResearch likes to work closely with their clients to produce the best solution for each. The company began in 1999 with the launch of the eMysteryShopper, a novel concept at the time. As of this writing, eDigitalResearch is looking to hire a developer and senior developer (in case anyone here is interested.)
Founded in 2003, Lexalytics is proud to have brought the first sentiment analysis engine to market. Designed to integrate with third-party applications, their text analysis software is chugging along in the background at many data-related companies. Lexalytics is headquartered in Amherst, Massachusetts.
Cynthia Murrell, November 11, 2014
November 6, 2014
Here’s an interesting development from the world of text-processing technology. GeekWire reports, “Microsoft and Amazon Vets Form Textio, a New Startup Looking to Discover Patterns in Documents.” The new company expects to release its first product next spring. Writer John Cook tells us:
“Kieran Snyder, a linguistics expert who previously worked at Amazon and Microsoft’s Bing unit, and Jensen Harris, who spent 16 years at Microsoft, including stints running the user experience team for Windows 8, have a formed a new data visualization startup by the name of Textio.
“The Seattle company’s tagline: ‘Turn business text into insights.’ The emergence of the startup was first reported by Re/code, which noted that the Textio tool could be used by companies to scour job descriptions, performance reviews and other corporate HR documents to uncover unintended discrimination. In fact, Textio was formed after Snyder conducted research on gender bias in performance reviews in the tech industry.”
That is an interesting origin, especially amid the discussions about gender that currently suffuse the tech community. Textio sees much room for improvement in text analytics, and hopes to help clients reach insights beyond those competing platforms can divine. CEO Snyder’s doctorate and experience in linguistics and cognitive science should give the young company an edge in the competitive field.
Cynthia Murrell, November 06, 2014
October 30, 2014
Most of the name surfing search experts—like the fellow who sold my content on Amazon without my permission and used my name to boot—will not recall much about Engenium. That’s no big surprise. Altegrity Kroll owns the pioneering company in the value-added indexing business. Altegrity, as you may know, is the owner of the outfit that cleared Edward Snowden for US government work.
I read “Snowden Vetter Altegrity’s Loans Plunge: Distressed Debt”. In that article I learned:
Altegrity Inc., the security firm that vetted former intelligence contractor Edward Snowden, has about six months until it runs out of money as the loss of background-check contracts negate most of a July deal with lenders to extend maturities for five years.
The article reports that “selective default” looms for the company. With the lights flickering at a number of search and content processing firms, I hope that the Engenium technology survives. The system remains a leader in a segment which has a number of parvenus.
Stephen E Arnold, October 30, 2014
October 10, 2014
I recall learning a couple of years ago that Amazon was a great place to store big files. Some of the XML data management systems embraced the low prices and pushed forward with cloud versions of their services.
When I read “Amazon’s DynamoDB Gets Hugely Expanded Free Tier And Native JSON Support,” I formed some preliminary thoughts. The trigger was this passage in the write up:
Is JSON better than XML? Is JSON easier to use than XML? Is JSON development faster than XML? Ask an XML rock star and the answer is probably, “You crazy.” I can hear the guitar riff from Joe Walsh now.
Ask a 20 year old in a university programming class, and the answer may be different. I asked the 20 something sitting in my office about XML and he snorted: “Old school, dude.” I hire only people with respect for their elders, of course.
Here are the thoughts that flashed through my 70 year old brain:
- Is Amazon getting ready to make a push for the customers of Oracle, MarkLogic, and other “real” database systems capable of handling XML?
- Will Amazon just slash prices, take the business, and make the 20 year old in my office a customer for life just because Amazon is “new school”?
- Will Amazon’s developer love provide the JSON fan with development tools, dashboards, features, and functions that push clunky methods like proprietary Xquery messages into a reliquary?
No answers… yet.
Stephen E Arnold, October 10, 2014
October 6, 2014
I read “Objective Announces Partnership with Active Navigation, Helping Organisations Reduce Unstructured Content by 30%.” The angle is different from some content marketing outfits’ approach to search. Instead of dragging out the frequently beaten horse “all available information,” the focus is trimming the fat.
According to the write up:
The Active Navigation and Objective solution, allows organisations to quickly identify the important corporate data from their ROT information, then clean and migrate it into a fully compliant repository
How does the system work? I learned:
Following the Objective Information Audit, in which all data will be fully cleansed, an organisation’s content will be free of ROT information and policy violations. Immediate results from an Objective Information Audit typically deliver storage reductions of 30% to 50%, lowering the total cost of ownership for storage, while improving the performance of enterprise search, file servers, email servers and information repositories.
Humans plus smart software do the trick. In my view, this is an acknowledgement that the combination of subject matter experts plus software deliver a useful solution. The approach does work as long as the humans do not suffer “indexing fatigue” or budget cuts. Working time can also present a challenge. Management can lose patience.
Will organizations embrace an approach familiar to those who used decades old systems? Since some state of the art automated systems have delivered mixed results, perhaps a shift in methods is worth a try.
Stephen E Arnold, October 6, 2014
September 1, 2014
Last week I had a conversation with a publisher who has a keen interest in software that “knows” what content means. Armed with that knowledge, a system can then answer questions.
The conversation was interesting. I mentioned my presentations for law enforcement and intelligence professionals about the limitations of modern and computationally expensive systems.
Several points crystallized in my mind. One of these is addressed, in part, in a diagram created by a person interested in machine learning methods. Here’s the diagram created by SciKit:
The diagram is designed to help a developer select from different methods of performing estimation operations. The author states:
Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems. The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.
First, notice that there is a selection process for choosing a particular numerical recipe. Now who determines which recipe is the right one? The answer is the coding chef. A human exercises judgment about a particular sequence of operation that will be used to fuel machine learning. Is that sequence of actions the best one, the expedient one, or the one that seems to work for the test data? The answer to these questions determines a key threshold for the resulting “learning system.” Stated another way, “Does the person licensing the system know if the numerical recipe is the most appropriate for the licensee’s data?” Nah. Does a mid tier consulting firm like Gartner, IDC, or Forrester dig into this plumbing? Nah. Does it matter? Oh, yeah. As I point out in my lectures, the “accuracy” of a system’s output depends on this type of plumbing decision. Unlike a backed up drain, flaws in smart systems may never be discerned. For certain operational decisions, financial shortfalls or the loss of an operation team in a war theater can be attributed to one of many variables. As decision makers chase the Silver Bullet of smart, thinking software, who really questions the output in a slick graphic? In my experience, darned few people. That includes cheerleaders for smart software, azure chip consultants, and former middle school teachers looking for a job as a search consultant.
Second, notice the reference to a “rough guide.” The real guide is understanding of how specific numerical recipes work on a set of data that allegedly represents what the system will process when operational. Furthermore, there are plenty of mathematical methods available. The problem is that some of the more interesting procedures lead to increased computational cost. In a worst case, the more interesting procedures cannot be computed on available resources. Some developers know about N=NP and Big O. Others know to use the same nine or ten mathematical procedures taught in computer science classes. After all, why worry about math based on mereology if the machine resources cannot handle the computations within time and budget parameters? This means that most modern systems are based on a set of procedures that are computationally affordable, familiar, and convenient. Does this similar of procedures matter? Yep. The generally squirrely outputs from many very popular systems are perceived as completely reliable. Unfortunately, the systems are performing within a narrow range of statistical confidence. Stated in a more harsh way, the outputs are just not particularly helpful.
In my conversation with the publisher, I asked several questions:
- Is there a smart system like Watson that you would rely upon to treat your teenaged daughter’s cancer? Or, would you prefer the human specialist at the Mayo Clinic or comparable institution?
- Is there a smart system that you want directing your only son in an operational mission in a conflict in a city under ISIS control? Or, would you prefer the human-guided decision near the theater about the mission?
- Is there a smart system you want managing your retirement funds in today’s uncertain economy? Or, would you prefer the recommendations of a certified financial planner relying on a variety of inputs, including analyses from specialists in whom your analyst has confidence?
When I asked these questions, the publisher looked uncomfortable. The reason is that the massive hyperbole and marketing craziness about fancy new systems creates what I call the Star Trek phenomenon. People watch Captain Kirk talking to devices, transporting himself from danger, and traveling between far flung galaxies. Because a mobile phone performs some of the functions of the fictional communicator, it sure seems as if many other flashy sci-fi services should be available.
Well, this Star Trek phenomenon does help direct some research. But in terms of products that can be used in high risk environments, the sci-fi remains a fiction.
Believing and expecting are different from working with products that are limited by computational resources, expertise, and informed understanding of key factors.
Humans, particularly those who need money to pay the mortgage, ignore reality. The objective is to close a deal. When it comes to information retrieval and content processing, today’s systems are marginally better than those available five or ten years ago. In some cases, today’s systems are less useful.
July 21, 2014
The article titled Text Analytics Company Linguamatics Boosts Enterprise Search with Semantic Enrichment on MarketWatch discusses the launch of 12E Semantic Enrichment from Linguamatics. The new release allows for the mining of a variety of texts, from scientific literature to patents to social media. It promises faster, more relevant search for users. The article states,
“Enterprise search engines consume this enriched metadata to provide a faster, more effective search for users. I2E uses natural language processing (NLP) technology to find concepts in the right context, combined with a range of other strategies including application of ontologies, taxonomies, thesauri, rule-based pattern matching and disambiguation based on context. This allows enterprise search engines to gain a better understanding of documents in order to provide a richer search experience and increase findability, which enables users to spend less time on search.”
Whether they are spinning semantics for search, or if it is search spun for semantics, Linguamatics has made their technology available to tens of thousands of users of enterprise search. Representative John M. Brimacombe was straightforward in his comments about the disappointment surrounding enterprise search, but optimistic about 12E. It is currently being used by many top organizations, as well as the Food and Drug Administration.
Chelsea Kerwin, July 21, 2014