February 9, 2016
I noted a blog post called “From Discovery to Selection: Announcing the Seattle Accelerator’s Third Batch.” The post lists companies which Microsoft wants to nurture. Here’s the list:
- Affinio: Audience insights
- Agolo: Summarization of text
- Clarify: Rich media search
- Defined Crowd: Natural language processing
- Knomos: Palantir style analysis
- Medwhat: Doctor made of soft software
- OneBridge: Middleware for Microsoft cloud
- Percolata: Retail staff monitoring
- Plexuss: Palantir style analysis
- Sim Machines: Similarity search and pattern recognition
Net net: Microsoft continues to hunt for solutions in search and analytics. There is a touch of “me too” in the niche plays too. Persistence is a virtue.
Stephen E Arnold, February 9, 2016
February 2, 2016
A friend recently told me how they can go months avoiding suspicious emails, spyware, and Web sites on her computer, but the moment she hands her laptop over to her father he downloads a virus within an hour. Despite the technology gap existing between generations, the story goes to show how easy it is to deceive and steal information these days. ExpertClick thinks that metadata might hold the future means for cyber security in “What Metadata And Data Analytics Mean For Data Security-And Beyond.”
The article uses biological analogy to explain metadata’s importance: “One of my favorite analogies is that of data as proteins or molecules, coursing through the corporate body and sustaining its interrelated functions. This analogy has a special relevance to the topic of using metadata to detect data leakage and minimize information risk — but more about that in a minute.”
This plays into new companies like, Ayasdi, using data to reveal new correlations using different methods than the standard statistical ones. The article compares this to getting to the data atomic level, where data scientists will be able to separate data into different elements and increase the analysis complexity.
“The truly exciting news is that this concept is ripe for being developed to enable an even deeper type of data analytics. By taking the ‘Shape of Data’ concept and applying to a single character of data, and then capturing that shape as metadata, one could gain the ability to analyze data at an atomic level, revealing a new and unexplored frontier. Doing so could bring advanced predictive analytics to cyber security, data valuation, and counter- and anti-terrorism efforts — but I see this area of data analytics as having enormous implications in other areas as well.”
There are more devices connected to the Internet than ever before and 2016 could be the year we see a significant rise in cyber attacks. New ways to interpret data will leverage predictive and proactive analytics to create new ways to fight security breaches.
December 16, 2015
We thought it was a problem if law enforcement officials did not know how the Internet and Dark Web worked as well as the capabilities of eDiscovery tools, but a law firm that does not know how to work with data-mining tools much less the importance of technology is losing credibility, profit, and evidence for cases. According to Information Week in “Data, Lawyers, And IT: How They’re Connected” the modern law firm needs to be aware of how eDiscovery tools, predictive coding, and data science work and see how they can benefit their cases.
It can be daunting trying to understand how new technology works, especially in a law firm. The article explains how the above tools and more work in four key segments: what role data plays before trial, how it is changing the courtroom, how new tools pave the way for unprecedented approaches to law practice, how data is improving how law firms operate.
Data in pretrial amounts to one word: evidence. People live their lives via their computers and create a digital trail without them realizing it. With a few eDiscovery tools lawyers can assemble all necessary information within hours. Data tools in the courtroom make practicing law seem like a scenario out of a fantasy or science fiction novel. Lawyers are able to immediately pull up information to use as evidence for cross-examination or to validate facts. New eDiscovery tools are also good to use, because it allows lawyers to prepare their arguments based on the judge and jury pool. More data is available on individual cases rather than just big name ones.
“The legal industry has historically been a technology laggard, but it is evolving rapidly to meet the requirements of a data-intensive world.
‘Years ago, document review was done by hand. Metadata didn’t exist. You didn’t know when a document was created, who authored it, or who changed it. eDiscovery and computers have made dealing with massive amounts of data easier,’ said Robb Helt, director of trial technology at Suann Ingle Associates.”
Legal eDiscovery is one of the main branches of big data that has skyrocketed in the past decade. While the examples discussed here are employed by respected law firms, keep in mind that eDiscovery technology is still new. Ambulance chasers and other law firms probably do not have a full IT squad on staff, so when learning about lawyers ask about their eDiscovery capabilities.
November 3, 2015
The latest version of the TemaTres vocabulary server is now available, we learn from the company’s blog post, “TemaTres 2.0 Released.” Released under the GNU General Public License version 2.0, the web application helps manage taxonomies, thesauri, and multilingual vocabularies. The web application can be downloaded at SourceForge. Here’s what has changed since the last release:
*Export to Moodle your vocabulary: now you can export to Moodle Glossary XML format
*Metadata summary about each term and about your vocabulary (data about terms, relations, notes and total descendants terms, deep levels, etc)
*New report: reports about terms with mapping relations, terms by status, preferred terms, etc.
*New report: reports about terms without notes or specific type of notes
*Import the notes type defined by user (custom notes) using tagged file format
*Select massively free terms to assign to other term
*Improve utilities to take terminological recommendations from other vocabularies (more than 300: http://www.vocabularyserver.com/vocabularies/)
*Update Zthes schema to Zthes 1.0 (Thanks to Wilbert Kraan)
*Export the whole vocabulary to Metadata Authority Description Schema (MADS)
*Fixed bugs and improved several functional aspects.
*Uses Bootstrap v3.3.4
See the server’s SourceForge page, above, for the full list of features. Though as of this writing only 21 users had rated the product, all seemed very pleased with the results. The TemaTres website notes that running the server requires some other open source tools: PHP, MySql, and HTTP Web server. It also specifies that, to update from version 1.82, keep the db.tematres.php, but replace the code. To update from TemaTres 1.6 or earlier, first go in as an administrator and update to version 1.7 through Menu-> Administration -> Database Maintenance.
Cynthia Murrell, November 3, 2015
October 26, 2015
An apt metaphor to explain big data is the act of braiding. Braiding requires person to take three or more locks of hair and alternating weaving them together. The end result is clean, pretty hairstyle that keeps a person’s hair in place and off the face. Big data is like braiding, because specially tailored software takes an unruly mess of data, including the combed and uncombed strands, and organizes them into a legible format. Perhaps this is why TopQuadrant named its popular big data software TopBraid, read more about its software upgrade in “TopQuadrant Launches TopBraid 5.0.”
TopBraid Suite is an enterprise Web-based solution set that simplifies the development and management of standards-based, model driven solutions focused on taxonomy, ontology, metadata management, reference data governance, and data virtualization. The newest upgrade for TopBraid builds on the current enterprise information management solutions and adds new options:
“ ‘It continues to be our goal to improve ways for users to harness the full potential of their data,’ said Irene Polikoff, CEO and co-founder of TopQuadrant. ‘This latest release of 5.0 includes an exciting new feature, AutoClassifier. While our TopBraid Enterprise Vocabulary Net (EVN) Tagger has let users manually tag content with concepts from their vocabularies for several years, AutoClassifier completely automates that process.’ “
The AutoClassifer makes it easier to add and edit tags before making them a part of the production tag set. Other new features are for TopBraid Enterprise Vocabulary Net (TopBraid EVN), TopBraid Reference Data Manager (RDM), TopBraid Insight, and the TopBraid platform, including improvements in internationalization and a new component for increasing system availability in enterprise environments, TopBraid DataCache.
TopBraid might be the solution an enterprise system needs to braid its data into style.
Whitney Grace, October 26, 2015
September 23, 2015
Here’s an interesting project: we received an announcement about funding for Pop Up Archive: Search Your Sound. A joint effort of the WGBH Educational Foundation and the American Archive of Public Broadcasting, the venture’s goal is nothing less than to make almost 40,000 hours of Public Broadcasting media content easily accessible. The American Archive, now under the care of WGBH and the Library of Congress, has digitized that wealth of sound and video. Now, the details are in the metadata. The announcement reveals:
“As we’ve written before, metadata creation for media at scale benefits from both machine analysis and human correction. Pop Up Archive and WGBH are combining forces to do just that. Innovative features of the project include:
*Speech-to-text and audio analysis tools to transcribe and analyze almost 40,000 hours of digital audio from the American Archive of Public Broadcasting
*Open source web-based tools to improve transcripts and descriptive data by engaging the public in a crowdsourced, participatory cataloging project
*Creating and distributing data sets to provide a public database of audiovisual metadata for use by other projects.
“In addition to Pop Up Archive’s machine transcripts and automatic entity extraction (tagging), we’ll be conducting research in partnership with the HiPSTAS center at University of Texas at Austin to identify characteristics in audio beyond the words themselves. That could include emotional reactions like laughter and crying, speaker identities, and transitions between moods or segments.”
The project just received almost $900,000 in funding from the Institute of Museum and Library Services. This loot is on top of the grant received in 2013, from the Corporation for Public Broadcasting, that got the project started. But will it be enough money to develop a system that delivers on-point results? If not, we may be stuck with something clunky, something that resembles the old Autonomy Virage, Blinkxx, Exalead video search, or Google YouTube search. Let us hope this worthy endeavor continues to attract funding so that, someday, anyone can reliably (and intuitively) find valuable Public Broadcasting content.
Cynthia Murrell, September 23, 2015
June 18, 2015
When it comes to enterprise technology these days, it is all about making software compliant for a variety of platforms and needs. Compliancy is the name of the game for Basho, says Diginomica’s article, “Basho Aims For Enterprise Operational Simplicity With New Data Platform.” Basho’s upgrade to its Riak Data Platform makes it more integration with related tools and to make complex operational environments simpler. Data management and automation tools are another big seller for NoSQL enterprise databases, which Basho also added to the Riak upgrade. Basho is not the only company that is trying to improve NoSQL enterprise platforms, these include MongoDB and DataStax. Basho’s advantage is delivering a solution using the Riak data platform.
Basho’s data platform already offers a variety of functions that people try to get to work with a NoSQL database and they are nearly automated: Riak Search with Apache Solr, orchestration services, Apache Spark Connector, integrated caching with Redis, and simplified development using data replication and synchronization.
“CEO Adam Wray released some canned comment along with the announcement, which indicates that this is a big leap for Basho, but also is just the start of further broadening of the platform. He said:
‘This is a true turning point for the database industry, consolidating a variety of critical but previously disparate services to greatly simplify the operational requirements for IT teams working to scale applications with active workloads. The impact it will have on our users, and on the use of integrated data services more broadly, will be significant. We look forward to working closely with our community and the broader industry to further develop the Basho Data Platform.’”
The article explains that NoSQL market continues to grow and enterprises need management as well as automation to manage the growing number of tasks databases are used for. While a complete solution for all NoSQL needs has been developed, Basho comes fairly close.
Whitney Grace, June 18, 2015
April 8, 2015
Anyone interested in the mechanics behind image search should check out the description of PicSeer: Search Into Images from YangSky. The product write-up goes into surprising detail about what sets their “cognitive & semantic image search engine” apart, complete with comparative illustrations. The page’s translation seems to have been done either quickly or by machine, but don’t let the awkward wording in places put you off; there’s good information here. The text describes the competition’s approach:
“Today, the image searching experiences of all major commercial image search engines are embarrassing. This is because these image search engines are
- Using non-image correlations such as the image file names and the texts in the vicinity of the images to guess what are the images all about;
- Using low-level features, such as colors, textures and primary shapes, of image to make content-based indexing/retrievals.”
With the first approach, they note, trying to narrow the search terms is inefficient because the software is looking at metadata instead of inspecting the actual image; any narrowed search excludes many relevant entries. The second approach above simply does not consider enough information about images to return the most relevant, and only most relevant, results. The write-up goes on to explain what makes their product different, using for their example an endearing image of a smiling young boy:
“How can PicSeer have this kind of understanding towards images? The Physical Linguistic Vision Technologies have can represent cognitive features into nouns and verbs called computational nouns and computational verbs, respectively. In this case, the image of the boy is represented as a computational noun ‘boy’ and the facial expression of the boy is represented by a computational verb ‘smile’. All these steps are done by the computer itself automatically.”
See the write-up for many more details, including examples of how Google handles the “boy smiles” query. (Be warned– there’s a very brief section about porn filtering that includes a couple censored screenshots and adult keyword examples.) It looks like image search technology progressing apace.
Cynthia Murrell, April 08, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
January 13, 2015
Germany’s foreign intelligence arm (BND) refuses to be outdone by our NSA. The World Socialist Web Site reports, “German Foreign Intelligence Service Plans Real-Time Surveillance of Social Networks.” The agency plans to invest €300 million by 2020 to catch up to the (Snowden-revealed) capabilities of U.S. and U.K. agencies. The stated goal is to thwart terrorism, of course, but reporter Sven Heymann is certain the initiative has more to do with tracking political dissidents who oppose the austerity policies of recent years.
Whatever the motivation, the BND has turned its attention to the wealth of information to be found in metadata. Smart spies. Heymann writes:
“While previously, there was mass surveillance of emails, telephone calls and faxes, now the intelligence agency intends to focus on the analysis of so-called metadata. This means the recording of details on the sender, receiver, subject line, and date and time of millions of messages, without reading their content.
“As the Süddeutsche Zeitung reported, BND representatives are apparently cynically attempting to present this to parliamentary deputies as the strengthening of citizens’ rights and freedoms in order to sell the proposal to the public.”
“In fact, the analysis of metadata makes it possible to identify details about a target person’s contacts. The BND is to be put in a position to know who is communicating with whom, when, and by what means. As is already known, the US sometimes conducts its lethal and illegal drone attacks purely on the basis of metadata.”
The article tells us the BND is also looking into the exploitation of newly revealed security weaknesses in common software, as well as tools to falsify biometric-security images (like fingerprints or iris scans). Though Germany’s intelligence agents are prohibited by law from spying on their own people, Heymann has little confidence that rule will be upheld. After all, so is the NSA.
Cynthia Murrell, January 13, 2015
November 28, 2014
As the Internet grows and evolves, the features users expect from search and content management systems is changing. SearchContentManagement addresses the shift in “Semantic Technologies Fuel the Web Experience Wave.” As the title suggests, writer Geoffrey Bock sees this shift as opening a new area with a new set of demands — “web experience management” (WEM) goes beyond “web content management” (WCM).
The inclusion of metadata and contextual information makes all the difference. For example, the information displayed by an airline’s site should, he posits, be different for a user working at their PC, who may want general information, and someone using their phone in the airport parking lot, where they probably need to check their gate number or see whether their flight has been delayed. (Bock is disappointed that none of the airlines’ sites yet work this way.)
The article continues:
“Not surprisingly, to make contextually aware Web content work correctly, a lot of intelligence needs to be added to the underlying information sources, including metadata that describes the snippets, as well as location-specific geo-codes coming from the devices themselves. There is more to content than just publishing and displaying it correctly across multiple channels. It is important to pay attention to the underlying meaning and how content is used — the ‘semantics’ associated with it.
“Another aspect of managing Web experiences is to know when you are successful. It’s essential to integrate tracking and monitoring capabilities into the underlying platform, and to link business metrics to content delivery. Counting page views, search terms and site visitors is only the beginning. It’s important for business users to be able to tailor metrics and reporting to the key performance indicators that drive business decisions.”
Bock supplies an example of one company, specialty-plumbing supplier Uponor, that is making good use of such “WEM” possibilities. See the article for more details on his strategy for leveraging the growing potential of semantic technology.
Cynthia Murrell, November 28, 2014