Microsoft Azure Search Documentation
September 2, 2014
Microsoft has posted information about the Azure Search service. You can find the information at Azure Search Preview. The features remind me of Amazon’s cloud search approach.
The idea is that search is available. The “How It Works” section summarizes the procedures the customer follows. The approach is intended for engineers familiar with Microsoft conventions or a consultant capable of performing the required steps.
Of particular interest to potential licensees will be the description of the pricing options. The Preview Pricing Details uses an Amazon like approach as well; for example, combinable search units. For higher demand implementations, Microsoft provides a custom price quote. The prices in the table below represent a 50 percent preview discount:
Microsoft offers different “editions” of Azure Search. Microsoft says:
Free is a free version of Azure Search designed to provide developers a sandbox to test features and implementations of Search. It is not designed for production workloads. Standard is the go-to option for building applications that benefit from a self-managed search-as-a-service solution. Standard delivers storage and predictable throughput that scales with application needs. For very high-demand applications, please contact azuresearch_contact@microsoft.com.
Support and service level agreements are available. A pricing calculator is available. Note that the estimates are not for search alone. Additional pricing information points to a page with four categories of fees and more than two dozen separate services. The link to Azure Search Pricing is self-referential, which is interesting to me.
I was not able to locate an online demo of the service. I was invited to participate in a free trial.
If you are interested in the limits for the free trial, Microsoft provides some information in its “Maximum Limits for Shared (Free) Search Service.”
Based on the documentation, correctly formed content uploaded permits full text search, facets, and hit highlighting. Specific functionalities are outlined on this reference page.
Net net: The search system is developer centric.
Stephen E Arnold, September 2, 2014
Big Data Should Mean Big Quality
September 2, 2014
Why does logic seem to fail in the face of fancy jargon? DataFusion’s Blog posted on the jargon fallacy in the post, “It All Begins With Data Quality.” The post explains how with new terms like big data, real-time analytics, and self-service business intelligence that the basic fundamentals that make this technology work are forgotten. Cleansing, data capture, and governance form the foundation for data quality. Without data quality, big data software is useless. According to a recent Aberdeen Group study, data quality was ranked as the most important data management function.
Data quality also leads to other benefits:
“When examining organizations that have invested in improving their data, Aberdeen’s research shows that data quality tools do in fact deliver quantifiable improvements. There is also an additional benefit: employees spend far less time searching for data and fixing errors. Data quality solutions provided an average improvement of 15% more records that were complete and 20% more records that were accurate and reliable. Furthermore, organizations without data quality tools reported twice the number of significant errors within their records; 22% of their records had these errors.”
Data quality saves man hours, discovers missing errors, and deleted duplicate records. The Aberdeen Group’s study also revealed that poor data quality is a top concern. Organizations should deploy a data quality tool, so they too can take advantage of its many benefits. It is a logical choice.
Whitney Grace, September 02, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Altegrity Gets Fresh Funding
September 2, 2014
Could someone please explain why it is a good idea to pay old debt off with new debt? While there might be a lower interest rate or more time to pay off the loans, it seems Altegrity needs to cut its losses before things get worse for it. BizJournals.com describes Altegrity’s financial state in the article, “Altegrity, USIS’ parent, Remains Buried In Billions In Debt” (sic).
You might recognize Altegrity as the parent company of USIS, which performs background checks for the federal government and is currently under a fraud investigation. The fraud tied with debt places the company in a bad spotlight. According to Standards and Poor’s Financial Services, LLC anyone who gives them a loan only stands to recover 10 percent of the funds if they are lucky.
This does not sound good either:
“And while extending the debt maturities provides ‘a modestly improved capital structure and liquidity profile,’ Altegrity remains on shaky ground to S&P. Its corporate rating of ‘CCC+’ from the ratings agency means it has “very weak financial security characteristics, and is dependent on favorable business conditions to meet financial commitments.” That mirrors the Caa2 rating from Moody’s Investors Service Inc., which a month ago called Altegrity’s debt profile ‘unsustainable.’”
Even worse is that the US government has Altegrity’s fate in its hands. The only news for Altegrity is bad. While we do not give financial advice, in this case, do not invest with this company. Next year does not look good for them at all.
Whitney Grace, September 02, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Microsoft Shakes Up SharePoint Online to Increase Storage
September 2, 2014
In response to an ever-increasing need for storage, Microsoft has announced changes to the way SharePoint Online manages storage blocks. Read about the latest announcement in the PC World article, “Microsoft Tweaks SharePoint Online to Free Up Site Storage.”
The article begins:
“Microsoft has tweaked the controls in SharePoint Online to let administrators make better use of storage resources allocated to SharePoint websites. The changes seek to make processes more automated, and to add some flexibility in how storage for SharePoint Online is managed within the Office 365 suite. Until now, SharePoint site collections, which are groups of related SharePoint websites, had to be assigned a set amount of storage, and that storage space couldn’t be used for anything else even if some of it went unused.”
Users and administrators will benefit from the increased flexibility. It also shows some effort on the part of Microsoft to improve the SharePoint user experience by taking care of some “no-brainer” flaws in the system. Stephen E. Arnold is a longtime leader in search and continues to keep an eye on the latest news in his SharePoint feed on ArnoldIT.com. Staying on top of these announcements is a great way for organizations to keep increasingly their SharePoint efficiency.
Emily Rae Aldridge, September 02, 2014
Huff Po and a Search Vendor Debunk Big Data Myths
September 1, 2014
I suppose I am narrow minded. I don’t associate the Huffington Post with high technology analyses. My ignorance is understandable because I don’t read the Web site’s content.
However, a reader sent me a link to “Top Three Big Data Myths: Debunked”, authored by a search vendor’s employee at Recommind. Now Recommind is hardly a household word. I spoke with a Recommind PR person about my perception that Recommind is a variant of the technology embodied in Autonomy IDOL. Yep, that company making headlines because of the minor dust up with Hewlett Packard. Recommind provides a probabilistic search system to customers that were originally involved in the legal market. The company has positioned its technology to other markets and added a touch of predictive magic as well. At its core, Recommind indexes content and makes the indexes available to users and other services. The company in 2010 formed a partnership with the Solcara search folks. Solcara is now the go to search engine for Thomson Reuters. I have lost track of the other deals in which Recommind has engaged.
The write up reveals quite a bit about the need for search vendors to reach a broader market in order to gain visibility to make the cost of sales bearable. This write up is a good example of content marketing and the malleability of outfits like Huffington Post. The idea strikes me as something that looks interesting may get a shot at building the click traffic for Ms. Huffington’s properties.
So what does the article debunk? Fasten your seat belt and take your blood pressure medicine. The content of the write up may jolt you. Ready?
First, the article reveals that “all” data are not valuable. The way the write up expresses it takes this form, “Myth #1—All Data Is Valuable.” Set aside the subject verb agreement error. Data is the plural and datum is the singular. But in this remarkable content marketing essay, grammar is not my or the author’s concern. The notion of categorical propositions applied to data is interesting and raises many questions; for example, what data? So the first my is that if one if able to gather “all data”, it therefore follows that some is not germane. My goodness, I had a heart palpitation with this revelation.
Second, the next myth is that “with Big Data the more information the better.” I must admit this puzzles me. I am troubled by the statistical methods used to filter smaller, yet statistically valid, subsets of data. Obviously the predictive Bayesian methods of Recommind can address this issue. The challenges Autonomy like syst4ems face are well known to some Autonomy licensees and, I assume, to the experts at Hewlett Packard. The point is that if the training information is off base by a smidge and the flow of content does not conform to the training set, the outputs are often off point. Now with “more information” the sampling purists point to sampling theory and the value of carefully crafted training sets. No problem on my end, but aren’t we emphasizing that certain non Bayesian methods are just not a wonderful as Recommind’s methods? I think so.
The third myth that the write up “debunks” is “Big Data opportunities come with no costs.” I think this is a convoluted way of saying that get ready to spend a lot of money to embrace Big Data. When I flip this debunking on its head, and I get this hypothesis, “The Recommind method is less expensive than the Big Data methods that other hype artists are pitching as the best thing since sliced bread.
The fix is “information governance.” I musty admit that like knowledge management, I have zero idea what the phrase means. Invoking a trade association anchored in document scanning does not give me confidence that an explanation will illuminate the shadows.
Net net: The myths debunked just set up myths for systems based on aging technology. Does anyone notice? Doubt it.
Stephen E Arnold, September 1, 2014
Autumn Approaches: Time for Realism about Search
September 1, 2014
Last week I had a conversation with a publisher who has a keen interest in software that “knows” what content means. Armed with that knowledge, a system can then answer questions.
The conversation was interesting. I mentioned my presentations for law enforcement and intelligence professionals about the limitations of modern and computationally expensive systems.
Several points crystallized in my mind. One of these is addressed, in part, in a diagram created by a person interested in machine learning methods. Here’s the diagram created by SciKit:
The diagram is designed to help a developer select from different methods of performing estimation operations. The author states:
Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems. The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.
First, notice that there is a selection process for choosing a particular numerical recipe. Now who determines which recipe is the right one? The answer is the coding chef. A human exercises judgment about a particular sequence of operation that will be used to fuel machine learning. Is that sequence of actions the best one, the expedient one, or the one that seems to work for the test data? The answer to these questions determines a key threshold for the resulting “learning system.” Stated another way, “Does the person licensing the system know if the numerical recipe is the most appropriate for the licensee’s data?” Nah. Does a mid tier consulting firm like Gartner, IDC, or Forrester dig into this plumbing? Nah. Does it matter? Oh, yeah. As I point out in my lectures, the “accuracy” of a system’s output depends on this type of plumbing decision. Unlike a backed up drain, flaws in smart systems may never be discerned. For certain operational decisions, financial shortfalls or the loss of an operation team in a war theater can be attributed to one of many variables. As decision makers chase the Silver Bullet of smart, thinking software, who really questions the output in a slick graphic? In my experience, darned few people. That includes cheerleaders for smart software, azure chip consultants, and former middle school teachers looking for a job as a search consultant.
Second, notice the reference to a “rough guide.” The real guide is understanding of how specific numerical recipes work on a set of data that allegedly represents what the system will process when operational. Furthermore, there are plenty of mathematical methods available. The problem is that some of the more interesting procedures lead to increased computational cost. In a worst case, the more interesting procedures cannot be computed on available resources. Some developers know about N=NP and Big O. Others know to use the same nine or ten mathematical procedures taught in computer science classes. After all, why worry about math based on mereology if the machine resources cannot handle the computations within time and budget parameters? This means that most modern systems are based on a set of procedures that are computationally affordable, familiar, and convenient. Does this similar of procedures matter? Yep. The generally squirrely outputs from many very popular systems are perceived as completely reliable. Unfortunately, the systems are performing within a narrow range of statistical confidence. Stated in a more harsh way, the outputs are just not particularly helpful.
In my conversation with the publisher, I asked several questions:
- Is there a smart system like Watson that you would rely upon to treat your teenaged daughter’s cancer? Or, would you prefer the human specialist at the Mayo Clinic or comparable institution?
- Is there a smart system that you want directing your only son in an operational mission in a conflict in a city under ISIS control? Or, would you prefer the human-guided decision near the theater about the mission?
- Is there a smart system you want managing your retirement funds in today’s uncertain economy? Or, would you prefer the recommendations of a certified financial planner relying on a variety of inputs, including analyses from specialists in whom your analyst has confidence?
When I asked these questions, the publisher looked uncomfortable. The reason is that the massive hyperbole and marketing craziness about fancy new systems creates what I call the Star Trek phenomenon. People watch Captain Kirk talking to devices, transporting himself from danger, and traveling between far flung galaxies. Because a mobile phone performs some of the functions of the fictional communicator, it sure seems as if many other flashy sci-fi services should be available.
Well, this Star Trek phenomenon does help direct some research. But in terms of products that can be used in high risk environments, the sci-fi remains a fiction.
Believing and expecting are different from working with products that are limited by computational resources, expertise, and informed understanding of key factors.
Humans, particularly those who need money to pay the mortgage, ignore reality. The objective is to close a deal. When it comes to information retrieval and content processing, today’s systems are marginally better than those available five or ten years ago. In some cases, today’s systems are less useful.
The Importance of Publishing Replication Studies in Academic Journals
September 1, 2014
The article titled Why Psychologists’ Food Fight Matters on Slate discusses the issue of the lack of replication studies published in academic journals. In most cases, journals are looking for new information, exciting information, which will draw in their readers. While that is only to be expected, it can also cause huge problems in scientific method. Replication studies are important because science is built on laws. If a study cannot be replicated, then it’s finding should not be taken for granted. The article states,
“Since journal publications are valuable academic currency, researchers—especially those early in their careers—have strong incentives to conduct original work rather than to replicate the findings of others. Replication efforts that do happen but fail to find the expected effect are usually filed away rather than published. That makes the scientific record look more robust and complete than it is—a phenomenon known as the “file drawer problem.””
When scientists have an incentive to get positive results from a study, and little to no incentive to do replication studies, the results are obvious. Manipulation of data occurs, and few replication studies are completed. This also means that when the rare replication study is done, and refutes the positive finding, the scientist responsible for the false positive is a scapegoat for a much larger problem. The article suggests that academic journals encouraging more replication studies would assuage this problem.
Chelsea Kerwin, September 01, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
The Abilities and Promise of Watson IBMs Reasoning Computer
September 1, 2014
A video on Snapzu.com titled The Computer That’s Smarter Than YOU & I offers an explanation of Watson, IBM’s supercomputer. It begins with the beginning of civilization and humankind’s constant innovation since. With the creation of the microchip, modern technology really began to ramp up, and it asks (somewhat rhetorically) what will be the next great technological innovation? The answer is: the reasoning computer. The video shows a demo of the supercomputer trying to understand pros and cons on the sale of violent video games. Watson worked through the topic as follows,
“Scanned approximately 4 million Wikipedia articles. Returning ten most relevant articles. Scanned all three thousand sentences in top ten articles. Detected sentences which contain candidate claims. Identified borders of candidate claims. Assessed pro and con polarity of candidate claims. Constructed demo speech… the sale of violent video games should be banned.”
Watson went on to list his reasons for choosing this stance, such as “exposure to violent video games results in increased physiological arousal.” But he also offered a refutation, that the link between the games and actual violent action has not been proven. The ability of the computer to reason without human aid on its own is touted as the truly exciting innovation. Meanwhile, we are still waiting for a publicly accessible demo.
Chelsea Kerwin, September 01, 2014
Sponsored by ArnoldIT.com, developer of Augmentext