China and AI: Activity but Cash Friction
March 18, 2020
Despite being a communist country, China loves money. One way that China loves making money is through new technology startups, especially with AI startups. The South China Morning Post examines the current market of Chinese AI startups in the article, “China’s AI Start-Ups Are Closing More Funding Deals, Yet They’re Still Attracting Less Money Than The US.”
According to the article, Chinese AI startups have attracted more funding deals than their US counterparts, but they are not bringing in as much money as the past. The trade war with the US is a big determiner. The US continues to dominate in the AI market, but that has steadily been dropping from 71% in 2014 to 39% in 2019. The US also has more AI fundraisers than China. The US has a 64% of AI startup fundraising, up four points since the last year, while China only had 11%. Overall AI generated 2019 $26.6 billion across the globe in 2019.
China only had $2.9 billion from that total.
The big problem is that China lacks creativity in their AI startups:
“‘Investors are much more cautious now, especially after the second half of 2018 as we sensed a bubble – most AI projects haven’t produced strong performance,’ Chris Lai, a partner at Beijing-based Shunwei Capital, said. ‘We haven’t seen many AI applications that are beyond imagination, most are used in surveillance cameras.’”
There are some startups that add some spice to the Chinese AI market and bolster the hope that the country will be an AI world leader by 2030. Some of these startups include Horizon Robotics and facial recognition startup Face+++. The market is predicted to shift to vision and hearing applications with smart home appliances.
China might be moving towards more smart home appliances, but it is China for goodness sake! China has an authoritarian government, so it wants more AI security technology to track its citizens.
Whitney Grace, March 18, 2020
STM Publishing: A Cross Road or a Cross to Repurpose
March 18, 2020
The coronavirus, officially known as COVID-19, has upended the world. In the face of death, the world has shown its best and worst sides. Despite the global pandemic, society keeps chugging forward and humans are forced to adapt. Humans are washing their hands more and businesses are actively allowing their employees to telecommute. The biggest benefit is that the medical and science fields are actively pooling their knowledge to find a cure and create a COVID-19 vaccine. If profit was the main goal, however, the COVID-19 knowledge would be sold to the highest builder. The Los Angeles Times explains how for-profit science publishing could end, “COVID-19 Could Kill The For-Profit Science Publishing Model. That Would Be A Good Thing.”
Sharing scientific research information in real time is not standard and it is an exception to all practices. The amount information about SARS-CoV-2 (the virus that causes corona virus) on PubMed now amounts to more than four hundred articles. More information its supposed to help in a crisis.
The US government, however, does not follow the belief that more information is better. The Centers for Disease Control and Prevention canceled a briefing with infectious disease expert Nancy Messonnier. The CDC Web site also removed information about the number of people tested for corona virus. It is helpful to know how many people have been tested and infected to determine how fast it is spreading.
The COVID-19 shows how information circulates among medical professionals in a crisis:
“What’s most intriguing about the effect of the COVID-19 crisis on the distribution of scientific research is what it says about the longstanding research publication model: It doesn’t work when a critical need arises for rapid dissemination of data — like now.
The prevailing model today is dominated by for-profit academic publishing houses such as Elsevier, the publisher of such high-impact journals as Cell and the Lancet, and Springer, the publisher of Nature. But it’s under assault by universities and government agencies frustrated at being forced to pay for access to research they’ve funded in the first place.”
Publishers Springer, Elsevier, and other commercial scientific publishers have suspended their paywalls on corona virus information. They explain that the open access will only last the length of the outbreak and will not apply to other research. Researchers, however, want open access for everything be available.
The publishers explain the reason for paywalls and keeping information under lock and key, but researchers, librarians, scientists, and other experts want scientific information shared. Not sharing information, especially about diseases, is not beneficial. China cracked down about the corona outbreak in its media and also locked up its scientific research. This prevented the rest of the world from knowing the true extent of the pandemic and even about the virus origins.
STM publishing? Does the future embrace the models refined since the 17th century?
Whitney Grace, March 18, 2020
STM Publishers: The White House, NAS, and WHO Created a Content Collection! What?
March 17, 2020
DarkCyber is not working with a science, technology, or medical professional publishing outfit. Sure, my team and I did in the pre-retirement past. But the meetings which focused on cutting costs and boosting subscription prices were boring.
The interesting professional publisher meetings explored changing incentive plans to motivate a Pavlovian-responsive lawyer or accountant to achieve 10-10-20 were fun. (That means 10% growth, 10% cost reduction, and 20% profit.)
I am not sure how I got involved in these projects. I was a consultant, had written a couple of books, and was giving lectures with jazzy titles; for example, “The Future of the Datasphere,” “Search Is a Failure,” and “The Three R’s: Relationships, Rationality, and Revolution.” (Some of these now wonky talks are still available on the www.arnoldit.com Web site. Have at it, gentle reader.)
Have professional publishers of STM content received the millstone around the neck award?
This morning I hypothesized about the reaction of the professional publishing companies selling subscriptions to expensive journals to the news story “Microsoft, White House, and Allen Institute Release Coronavirus Data Set for Medical and NLP Researchers.” I learned:
The COVID-19 Open Research Dataset (CORD-19), a repository of more than 29,000 scholarly articles on the coronavirus family from around the world, is being released today for free. The data set is the result of work by Microsoft Research, the Allen Institute for AI, the National Library of Medicine at the National Institutes of Health (NIH), the White House Office of Science and Technology (OSTP), and others and includes machine-readable research from more than 13,000 scholarly articles. The aim is to empower the medical and machine learning research communities to mine text data for insights that can help fight COVID-19.
The most striking allegedly accurate factoid from the write up: No mention of the professional publishers who “create” and are the prime movers of journal articles. Authors, graduate students, academicians, scholars, and peer review ploughmen and plough women. Yes, professional publishing is sui generis.
Several observations:
- Did I miss the forward leaning contributions of the professional publishing community responsible for these STM documents and data sets?
- Are the professional publishers’ lawyers now gearing up for a legal action against these organizations and institutions creating a free content collection?
- Why didn’t one of the many professional publishing organizations, entities, and lobbying groups take the lead in creating the collection? The virus issue has been chugging along for months.
DarkCyber finds the go-getters behind the content collection a diverse group. Some of the players may be difficult to nail with a breach of licensing or copyright filing. If the article is true and the free assertion is a reality, has an important milestone been passed. Has a millstone been strapped to the neck of each of the STM professional publishing companies? Millstones are to be turned by the professional publishing content producers, not by upstarts like the White House and the World Health Organization.
Not as good as a Netflix show but good for a quick look.
Stephen E Arnold, March 17, 2020
The Problem of Too Much Info
March 17, 2020
The belief is that the more information one has the better decision one can make. Is this really true? The Eurasia Review shares how too much information might be a bad thing in the article, “More Information Doesn’t Necessarily Help People Make Better Decisions.”
According to the Stevens Institute of Technology, too much knowledge causes people to make worse decisions. The finding explains that there is a critical gap between assimilating new information with past knowledge and beliefs. Associate Professor of Computer Science at the Steves Institute Samantha Kleinberg is studying the phenomenon using AI and machine learning to investigate how financial advisors and healthcare professionals to their clients. She discovered:
“ ‘Being accurate is not enough for information to be useful,’ said Kleinberg. ‘It’s assumed that AI and machine learning will uncover great information, we’ll give it to people and they’ll make good decisions. However, the basic point of the paper is that there is a step missing: we need to help people build upon what they already know and understand how they will use the new information.’
For example: when doctors communicate information to patients, such as recommending blood pressure medication or explaining risk factors for diabetes, people may be thinking about the cost of medication or alternative ways to reach the same goal. ‘So, if you don’t understand all these other beliefs, it’s really hard to treat them in an effective way,’ said Kleinberg, whose work appears in the Feb. 13 issue of Cognitive Research: Principles and Implications.”
Kleinberg and her team studied 4,000 participants on their decision making processes with scenarios they would be familiar with to ones they would not. When confronted with an unusual problem, participants focused on the problem without any extra knowledge, but if they were asked to deal with a regular scenario such as healthcare or finances their prior knowledge got in the way.
Information overload and not being able to merge old information with the new is a problem. How do you fix it? Your answer is as good as mine.
Whitney Grace, March 17, 2020
Need a List of Hacker Handles?
March 17, 2020
Just a quick note. Navigate to Black Hat Pro Tools, and click on “Community,” then “Members.” The site provides a tidy list of several thousand hacker handles. Here’s an example, including three identities associated with “Elite Team”:
What’s the value of these? Some hackers, just like regular people, reuse their online names or portions of those names. With the right investigative tools, one can pinpoint other related and sometimes interested information. Black Hat Pro Tools does not require special software to visit.
Stephen E Arnold, March 17, 2020
AI: Big Hat, Some Cattle
March 17, 2020
Andreessen-Horowitz recently published the article: “The New Business Of AI (And How It’s Different From Traditional Software) that pulls back the curtain on AI startups. Locklin On Science delves further into AI startups with the aptly named post: “Andreessen-Horowitz Craps On ‘AI’ Startups From A Great Height.” AI startups are similar to other startups in that there is a lot of hype over a subpar product.
The biggest mistake people are making is that AI is really machine learning. Machine learning is the basis for AI and the terms should not be used interchangeably. Another problem is that AI can be treated like traditional software, however, this is far from the truth. AI software requires a cloud infrastructure which has mounds of hidden and associated costs. Also businesses believe once they launch an AI project, then humans are out of the equation. Nope!
“Everyone in the business knows about this. If you’re working with interesting models, even assuming the presence of infinite accurately labeled training data, the “human in the loop” problem doesn’t ever completely go away. A machine learning model is generally “man amplified.” If you need someone (or, more likely, several someone’s) making a half million bucks a year to keep your neural net producing reasonable results, you might reconsider your choices. If the thing makes human level decisions a few hundred times a year, it might be easier and cheaper for humans to make those decisions manually, using a better user interface.”
AI or machine learning startups also are SaaS companies disguised as a software business. They might appear to offer a one time out-of-the-box solution that only requires the occasional upgrade, but that is a giant fib. Machine learning can have a huge ROI, but all the factors need to be weighed before it is implemented. Machine learning and AI technology is the most advanced software on the market, thus the most expensive. It might be better to invest in better, experienced software and humans before trying to step foot into the future.
Whitney Grace, March 17, 2020
Content for Deep Learning: The Lionbridge View
March 17, 2020
Here is a handy resource. Lionbridge AI shares “The Best 25 Datasets for Natural Language Processing.” The list is designed as a starting point for those just delving into NLP. Writer Meiryum Ali begins:
“Natural language processing is a massive field of research. With so many areas to explore, it can sometimes be difficult to know where to begin – let alone start searching for data. With this in mind, we’ve combed the web to create the ultimate collection of free online datasets for NLP. Although it’s impossible to cover every field of interest, we’ve done our best to compile datasets for a broad range of NLP research areas, from sentiment analysis to audio and voice recognition projects. Use it as a starting point for your experiments, or check out our specialized collections of datasets if you already have a project in mind.”
The suggestions are divided by purpose. For use in sentiment analysis, Ali notes one needs to train machine learning models on large, specialized datasets like the Multidomain Sentiment Analysis Dataset or the Stanford Sentiment Treebank. Some text datasets she suggests for natural language processing tasks like voice recognition or chatbots include 20 Newsgroups, the Reuters News Dataset, and Princeton University’s WordNet. Audio speech datasets that made the list include the audiobooks of LibriSpeech, the Spoken Wikipedia Corpora, and the Free Spoken Digit Dataset. The collection concludes with some more general-purpose datasets, like Amazon Reviews, the Blogger Corpus, the Gutenberg eBooks List, and a set of questions and answers from Jeopardy. See the write-up for more on each of these entries as well as the rest of Ali’s suggestions in each category.
This being a post from Lionbridge, an AI training data firm, it naturally concludes with an invitation to contact them when ready to move beyond these pre-made datasets to one customized for you. Based in Waltham, Massachusetts, the company was founded in 1996 and acquired by H.I.G. Capital in 2017.
Cynthia Murrell, March 17, 2020
Microsoft Teams: Demand-Centric Scaling a Problem?
March 16, 2020
Quick item. DarkCyber noted two separate write ups which seem to suggest that Microsoft Azure has some fascinating characteristics. “Microsoft Teams Goes Down Just as Europe Logs On to Work Remotely” says “Two hours of issues as many work from home during the cornonavirus pandemic.” VentureBeat says “Microsoft Teams Struggles As Coronavirus Pushes Millions to Work from Home.” DarkCyber looks forward to verification that an outage took place. Also, what happens if the proposed Microsoft JEDI solution demonstrates the same behavior in an even more critical situation?
Stephen E Arnold, March 16, 2020
Open Source Weaponized: Can Amazon Dent the Future of Target and Walmart?
March 16, 2020
“Amazon Courts Walmart, Target to Join Cashierless Tech Group” is interesting but cut loose from the type of footnotes, named sources, and back up data some find helpful. Plus the WSJ states: “Retailers don’t yet plan to participate, but talks highlight Amazon’s ambition to have others adopt its technology.” If accurate, this is a page from the Amazon policeware / blockchain playbook. (For a free summary of DarkCyber’s Amazon policeware report, fill in the request form at this link.)
Retailers don’t yet plan to participate, but talks highlight Amazon’s ambition to have others adopt its technology
Amazon’s online bookstore bulldozer is revving through supply chain and demand mud. Despite the overheating of the big diesel engine, the S-Team is not resting on its laurels.
According to the Murdoch-inspired newspaper, “sources” have revealed “Amazon is making some of the software that underpins its Go stores available through an organization called Dent.” The idea seems to be that some of the technology would be open source.
DarkCyber finds the sourceless news interesting. Let’s assume that the write up is 100 percent accurate. Why give away a technology that could make Amazon’s AWS system some money? How open source is the Bezos bulldozer? What bits and pieces of digital connective tissue will be needed to make the open source technology work?
There are no answers to these questions. DarkCyber has formulated some other questions, and these also cannot be answered in a definitive way. Let’s look at these:
- Is Amazon use of open source a weaponization of the core ideas of open source software?
- How will the open source community respond to Amazon’s alleged embrace of open source?
- What type of pressure will Amazon’s open source play, if it indeed an accurate characterization of the WSJ sources’ factoids is accurate, put on its competitors?
- What does Amazon gain by making Target and Walmart look like outfits who don’t want to ride the Bezos bulldozer?
Net net: If the WSJ story is accurate, DarkCyber will have to reassess Amazon’s willingness to use certain types of digital data as a weapon. Like many weapons, caution is usually prudent. Mishandling can make downstream events in a chain quite interesting. Just Walk Out may garner a new connotation.
Stephen E Arnold, March 16, 2020
IslandInText Reborn: TLDRThis
March 16, 2020
Many years ago (maybe 25+), we tested a desktop summarization tool called IslandInText. [#1 below] I believe, if my memory is working today, this was software developed in Australia by Island Software. There was a desktop version and a more robust system for large-scale summarizing of text. In the 1980s, there was quite a bit of interest in automatic summarization of text. Autonomy’s system could be configured to generate a précis if one was familiar with that system. Google’s basic citation is a modern version of what smart software can do to suggest what’s in a source item. No humans needed, of course. Too expensive and inefficient for the big folks I assume.
For many years, human abstract and indexing professionals were on staff. Our automated systems, despite their usefulness, could not handle nuances, special inclusions in source documents like graphs and tables, list of entities which we processed with the controlled term MANYCOMPANIES, and other specialized functions. I would point out that most of today’s “modern” abstracting and indexing services are simply not as good as the original services like ABI / INFORM, Chemical Abstracts, Engineering Index, Predicasts, and other pioneers in the commercial database sector. (Anyone remember Ev Brenner? That’s what I thought, gentle reader. One does not have to bother oneself with the past in today’s mobile phone search expert world.)
For a number of years, I worked in the commercial database business. In order to speed the throughput of our citations to pharmaceutical, business, and other topic domains – machine text summarization was of interest to me and my colleagues.
A reader informed me that a new service is available. It is called TLDRThis. Here’s what the splash page looks like:
One can paste text or provide a url, and the system returns a synopsis of the source document. (The advanced service generates a more in dept summary, but I did not test this. I am not too keen on signing up without knowing what the terms and conditions are.) There is a browser extension for the service. For this url, the system returned this summary:
Enterprise Search: The Floundering Fish!
Stephen E. Arnold Monitors Search,Content Processing,Text Mining,Related Topics His High-Tech Nerve Center In Rural Kentucky.,He Tries To Winnow The Goose Feathers The Giblets. He Works With Colleagues,Worldwide To Make This Web Log Useful To Those Who Want To Go,Beyond Search . Contact Him At Sa,At,Arnoldit.Com. His Web Site,With Additional Information About Search Is | Oct 27, 2011 | Time Saved: 5 mins
- I am thinking about another monograph on the topic of “enterprise search.” The subject seems to be a bit like the motion picture protagonist Jason.
- The landscape of enterprise search is pretty much unchanged.
- But the technology of yesterday’s giants of enterprise search is pretty much unchanged.
- The reality is that the original Big Five had and still have technology rooted in the mid to late 1990s.
We noted several positive functions; for example, identifying the author and providing a synopsis of the source, even the goose feathers’ reference. On the downside, the system missed the main point of the article; that is, enterprise search has been a bit of a chimera for decades. Also, the system ignored the entities (company names) in the write up. These are important in my experience. People search for names, concepts, and events. The best synopses capture some of the entities and tell the reader to get the full list and other information from the source document. I am not sure what to make of the TLDRThis’ display of a picture which makes zero sense without the context of the full article. I fed the system a PDF which did not compute and I tried a bit.ly link which generated a request to refresh the page, not the summary.
To get an “advanced summary”, one must sign up. I did not choose to do that. I have added this site to our “follow” list. I will make a note to try and find out who developed this service.
The pricing ranges from free for basic summarization to $60 per year for Bronze level service. Among its features are 100 summaries per month and “exclusive features”. These are coming soon. The top level service is $10 per month. The fee includes 300 summaries a month and “exclusive features.” These are also coming soon. The Platinum service is $20 per month and includes 1,000 summaries per month. These are “better” and will include forthcoming advanced features.
Stay tuned.
[#1 ] In the early 1990s, search and retrieval was starting to move from the esoteric world of commercial databases to desktop and UNIX machines. IslandSoft, founded in 1993, offered a search and retrieval system. My files from this time revealed that IslandSoft’s description of its system could be reused by today’s search and retrieval marketers. Here’s what IslandSoft said about InText:
IslandInTEXT is a document retrieval and management application for PCs and Unix workstations. IslandInTEXT’s powerful document analysis engine lets users quickly access documents through plain English queries, summarize large documents based on content rather than key words, and automatically route incoming text and documents to user-defined SmartFolders. IslandInTEXT offers the strongest solution yet to help organize and utilize information with large numbers of legacy documents residing on PCs, workstations, and servers as well as the proliferation of electronic mail documents and other data. IslandInTEXT supports a number of popular word processing formats including IslandWrite, Microsoft Word, and WordPerfect plus ASCII text.
IslandInTEXT Includes:
- File cabinet/file folder metaphor.
- HTML conversion.
- Natural language queries for easily locating documents.
- Relevancy ranking of query results.
- Document summaries based on statistical relevance from 1 to 99% of the original document—create executive summaries of large documents instantly. [This means that the user can specify how detailed the summarization was; for example, a paragraph or a page or two.]
- Summary Options. Summaries can be based on key word selection, key word ordering, key sentences, and many more.
[For example:] SmartFolder Routing. Directs incoming text and documents to user-defined folders. Hot Link Pointers. Allow documents to be viewed in their native format without creating copies of the original documents. Heuristic/Learning Architecture. Allows InTEXT to analyze documents according to the author’s style.
A page for InText is still online as of today at http://www.intext.com/. The company appears to have ceased operations in 2010. Data in my files indicate that the name and possibly the code is owned by CP Software, but I have not verified this. I did not include InText in my first edition of Enterprise Search Report, which I wrote in 2003 and 2004. The company had falled behind market leaders Autonomy, Endeca, and Fast Search & Transfer.
I am surprised at how many search and retrieval companies today are just traveling along well worn paths in the digital landscape. Does search work? Nope. That’s why there are people who specialize, remember things, and maintain personal files. Mobile device search means precision and recall are digital dodo birds in my opinion.
Stephen E Arnold, March 16, 2020