SAS Text Miner Promises Unstructured Insight
July 10, 2015
Big data is tools help organizations analyze more than their old, legacy data. While legacy data does help an organization study how their process have changed, the data is old and does not reflect the immediate, real time trends. SAS offers a product that bridges old data with the new as well as unstructured and structured data.
The SAS Text Miner is built from Teragram technology. It features document theme discovery, a function the finds relations between document collections; automatic Boolean rule generation; high performance text mining that quickly evaluates large document collection; term profiling and trending, evaluates term relevance in a collection and how they are used; multiple language support; visual interrogation of results; easily import text; flexible entity options; and a user friendly interface.
The SAS Text Miner is specifically programmed to discover data relationships data, automate activities, and determine keywords and phrases. The software uses predictive models to analysis data and discover new insights:
“Predictive models use situational knowledge to describe future scenarios. Yet important circumstances and events described in comment fields, notes, reports, inquiries, web commentaries, etc., aren’t captured in structured fields that can be analyzed easily. Now you can add insights gleaned from text-based sources to your predictive models for more powerful predictions.”
Text mining software reveals insights between old and new data, making it one of the basic components of big data.
Whitney Grace, July 10, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
How to Search Craigslist
June 21, 2015
Short honk. Looking for an item on Craigslist.org. The main Craigslist.org site wants you to look in your area and then manually grind through listings for other areas region by region. I read “How to Search All Craigslist at Once.” The article does a good job of explaining how to use Google and Ad Huntr. The write lists some other Craigslist search tools as well. A happy quack for Karar Halder, who assembled the article.
Stephen E Arnold, June 21, 2015
Content Grooming: An Opportunity for Tamr
June 20, 2015
Think back. Vivisimo asserted that it deduplicated and presented federated search results. There are folks at Oracle who have pointed to Outside In and other file conversion products available from the database company as a way to deal with different types of data. There are specialist vendors, which I will not name, who are today touting their software’s ability to turn a basket of data types into well-behaved rows and columns complete with metatags.
Well, not so fast.
Unifying structured and unstructured information is a time consuming, expensive process. The reasons for the obese exception files where objects which cannot be processed go to live out their short, brutish lives.
I read “Tamr Snaps Up $25.2 Million to Unify Enterprise Data.” The stakeholders know, as do I, that unifying disparate types of data is an elephant in any indexing or content analytics conference room. Only the naive believe that software whips heterogeneous data into Napoleonic War parade formations. Today’s software processing tools cannot get undercover police officers to look ship shape for the mayor.
Ergo, an outfit with an aversion to the vowel “e” plans to capture the flag on top of the money pile available for data normalization and information polishing. The write up states:
Tamr can create a central catalogue of all these data sources (and spreadsheets and logs) spread out across the company and give greater visibility into what exactly a company has. This has value on so many levels, but especially on a security level in light of all the recent high-profile breaches. If you do lose something, at least you have a sense of what you lost (unlike with so many breaches).
Tamr is correct. Organizations don’t know what data they have. I could mention a US government agency which does not know what data reside on the server next to another server managed by the same system administrator. But I shall not. The problem is common and it is not confined to bureaucratic blenders in government entities.
Tamr, despite the odd ball spelling, has Michael Stonebraker, a true wizard on the task. The write up mentions an outfit what might be politely described as a “database challenge” as a customer. If Thomson Reuters cannot figure out data after decades of efforts and millions upon millions of investment, believe me when I point out that Tamr may be on to something.
Stephen E Arnold, June 20, 2015
Chris McNulty at SharePoint Fest Seattle
June 18, 2015
For SharePoint managers and users, continued education and training is essential. There are lots of opportunities for virtual and face-to-face instruction. Benzinga gives some attention to one training option, the upcoming SharePoint Fest Seattle, in their recent article, “Chris McNulty to Lead 2 Sessions and a Workshop at SharePoint Fest Seattle.”
The article begins:
“Chris McNulty will preside over a full day workshop at SharePoint Fest Seattle on August 18th, 2015, as well as conduct two technical training sessions on the 19th and 20th. Both the workshops and sessions are to be held at the Washington State Convention Center in downtown Seattle.”
In addition to all of the great training opportunities at conferences and other face-to-face sessions, staying on top of the latest SharePoint news and online training opportunities is also essential. For a one-stop-shop of all the latest SharePoint news, stay tuned to Stephen E. Arnold’s Web site, ArnoldIT.com, and his dedicated SharePoint feed. He has turned his longtime career in search into a helpful Web service for those that need to stay on top of the latest SharePoint happenings.
Emily Rae Aldridge, June 18, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
How do You use Your Email?
April 28, 2015
Email is still a relatively new concept in the grander scheme of technology, having only been around since the 1990s. As with any human activity, people want to learn more about the trends and habits people have with email. Popular Science has an article called “Here’s What Scientists Learned In The Largest Systematic Study Of Email Habits” with a self-explanatory title. Even though email has been around for over twenty years, no one is quite sure how people use it.
So someone decided to study email usage:
“…researchers from Yahoo Labs looked at emails of two million participants who sent more than 16 billion messages over the course of several months–by far the largest email study ever conducted. They tracked the identities of the senders and the recipients, the subject lines, when the emails were sent, the lengths of the emails, and the number of attachments. They also looked at the ages of the participants and the devices from which the emails were sent or checked.”
The results were said to be so predictable that an algorithm could have predicted them. Usage has a strong correlation to age groups and gender. The young write short, quick responses, while men are also brief in their emails. People also responded more quickly during work hours and the more emails they receive the less likely they are to write a reply. People might already be familiar with these trends, but the data is brand new to data scientists. The article predicts that developers will take the data and design better email platforms.
How about creating an email platform that merges a to-do list with emails, so people don’t form their schedules and tasks from the inbox.
Whitney Grace, April 28, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Juvenile Journal Behavior
April 28, 2015
Ah, more publisher excitement. Neuroskeptic, a blogger at Discover, weighs in on a spat between scientific journals in, “Academic Journals in Glass Houses….” The write-up begins by printing a charge lobbed at Frontiers in Psychology by the Journal of Nervous and Mental Disease (JNMD), in which the latter accuses the former of essentially bribing peer reviewers. It goes on to explain the back story, and why the blogger feels the claim against Frontiers is baseless. See the article for those details, if you’re curious.
Here’s the part that struck me: Neuroskeptic supplies the example hinted at in his or her headline:
“For the JNMD to question the standards of Frontiers peer review process is a bit of a ‘in glass houses / throwing stones’ moment. Neuroskeptic readers may remember that it was JNMD who one year ago published a paper about a mysterious device called the ‘quantum resonance spectrometer’ (QRS). This paper claimed that QRS can detect a ‘special biological wave… released by the brain’ and thus accurately diagnose schizophrenia and other mental disorders – via a sensor held in the patient’s hand. The article provided virtually no details of what the ‘QRS’ device is, or how it works, or what the ‘special wave’ it is supposed to measure is. Since then, I’ve done some more research and as far as I can establish, ‘QRS’ is an entirely bogus technology. If JNMD are going to level accusations at another journal, they ought to make sure that their own house is in order first.”
This is more support for the conclusion that many of today’s “academic” journals cannot be trusted. Perhaps the profit-driven situation will be overhauled someday, but in the meantime, let the reader beware.
Cynthia Murrell, April 28, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Attensity’s Semantic Annotation Tool “Understands” Emoticons
April 27, 2015
The article on PCWorld titled For Attensity’s BI Parsing Tool, Emoticons Are No Problem explains the recent attempts at fine-tuning the monitoring and relaying the conversations about a particular organization or enterprise. The amount of data that must be waded through is massive, and littered with non-traditional grammar, language and symbols. Luminoso is one company interested in aiding companies with their Compass tool, in addition to Attensity. The article says,
“Attensity’s Semantic Annotation natural-language processing tool… Rather than relying on traditional keyword-based approaches to assessing sentiment and deriving meaning… takes a more flexible natural-language approach. By combining and analyzing the linguistic structure of words and the relationship between a sentence’s subject, action and object, it’s designed to decipher and surface the sentiment and themes underlying many kinds of common language—even when there are variations in grammatical or linguistic expression, emoticons, synonyms and polysemies.”
The article does not explain how exactly Attensity’s product works, only that it can somehow “understand” emoticons. This seems like an odd term though, and most likely actually refers to a process of looking it up from a list rather than actually being able to “read” it. At any rate, Attensity promises that their tool will save in hundreds of human work hours.
Chelsea Kerwin, April 27, 2014
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Four Visualization Tools to Choose From
February 12, 2015
MakeUseOf offers us a list of graphic-making options in its “4 Data Visualization Tools for Captivating Data Journalism.” Writer Brad Jones describes four options, ranging from the quick and easy to more complex solutions. The first entry, Tableau Public, may be the best place for new users to start. The write-up tells us:
“Data visualization can be a very complex process, and as such the programs and tools used to achieve good results can be similarly complex. Tableau Public, at first glance, is not — it’s a very accommodating, intuitive piece of software to start using. Simply import your data as a text file, an Excel spreadsheet or an Access database, and you’re up and running.
“You can create a chart simply by dragging and dropping various dimensions and measures into your workspace. Figuring out exactly how to produce the sort of visualizations you’re looking for might take some experimentation, but there’s no great challenge in creating simple charts and graphs.
“That said, if you’re looking to go further, Tableau Public can cater to you. It’ll take some time on your part to really understand the breadth of what’s on offer, but it’s a matter of learning a skill rather than the program itself being difficult to use.”
The next entry is Google Fusion Tables, which helpfully links to other Google services, and much of its process is automated. The strengths of Infoactive are its ability to combine datasets and a wealth of options to create cohesive longer content. Rounding out the list is R, which Jones warns is “obtuse and far from user friendly”; it even requires a working knowledge of JavaScript and its own proprietary language to make the most of its capabilities. However, he says there is simply nothing better for producing exactly what one needs.
Cynthia Murrell, February 12, 2015
Sponsored by ArnoldIT.com, developer of Augmentext
Fujitsu Creates its Own Hadoop Tool
January 19, 2015
Fujitsu has joined many other companies by taking Hadoop and creating its own software from it to leverage big data. IT Web Open Source’s article, “Fujitsu Makes It Easy For Customers To Reap The Benefits Of Big Data With PRIMEFLEX For Hadoop” divulges the details about the new software.
The new Hadoop application is part of Fijitsu’s PRIMEFLEX software line of workload specific integrated systems. Its purpose is similar to many other big data software on the market: harness big data and make use of actionable analytics. Fujitsu describes it as a wonder software:
“Fujitsu has developed PRIMEFLEX for Hadoop to simplify and tame big data. The powerful, dedicated all-in-one hardware cluster is designed to integrate with existing hardware infrastructures, introducing distributed parallel processing based on Cloudera Enterprise Hadoop. This is an open-source software framework which gathers, processes and analyses data from various sources, then puts together and presents the big picture on how to act on the information gathered.”
Fijitsu is a recognized and respected brand, but the big data market is saturated with other companies that offer comparable software. Other companies also started with a Hadoop based application as part of their software line-up. Fujitsu is entering the Hadoop analytics a little late.
Whitney Grace, January 19, 2015
Sponsored by ArnoldIT.com, developer of Augmentext
Organizing Content is a Manual or Automated Pain
January 16, 2015
Organizing uploaded content is a pain in the rear. In order to catalog the content, users either have to add tags manually or use an automated system that requires several tedious fields to be filled out. CMS Wire explains the difficulties with document organization in “Stop Pulling Teeth: A Better Way To Classify Documents.” Manual tagging is the longer of the two processes and if no one created a set of tagging standards, tags will be raining down from the cloud in a content mess. Automated fields are not that bad to work with if you have one or two documents to upload, but if you have a lot of files to fill out you are more prone to fill out the wrong information to finish the job.
Apparently there is a happy medium:
“Encourage users to work with documents the way they normally do and use a third party tool such as an auto classification tool to extract text based content, products, subjects and terms out of the document. This will create good, standardized metadata to use for search refinement. It can even be used to flag sensitive information or report content detected with code names, personally identifiable information such as credit card numbers, social security numbers or phone numbers.”
While the suggestion is sound, we thought that auto-classification tools were normally built in collaborative content platform like SharePoint. Apparently not. Third party software to improve enterprise platforms once more saves the day for the digital paper pusher.
Whitney Grace, January 16, 2015
Sponsored by ArnoldIT.com, developer of Augmentext