CyberOSINT banner

Need a 1.3 Gb Corpus with a Million Text Objects?

July 12, 2015

Short honk: If you have a search and content processing system, you might want to navigate to this link. You can access  the Hacker news data dump. My thought would be for the Watson team to process this information and then put up a demo of the Watson system using the Hacker News content. Any other search and content processing vendors game? interesting content and a beefy enough corpus to provide interesting results.

Stephen E Arnold, July 12, 2015

Dealing with Company and Product Identity: Terbium Labs Nails It

July 11, 2015

Navigate to and read about the company.


Nifty name. Very nifty name indeed. Now, a bit of branding commentary.

I used to work at Halliburton Nuclear. Ah, the good old days of nuclear engineers poking fun at civil engineers and mathematicians not understanding any joke made my the computer engineers.

The problem of naming companies in high technology disciplines is a very big one. Before Halliburton gobbled up the Nuclear Utility Services outfit, the company with more than 400 nuclear engineers on staff struggled with its name. Nuclear Utility Services was abbreviated to NUS. A pretty sharp copywriter named Richard Harrington of the dearly loved Ketchum, McLeod and Gove ad agency came up with this catchy line:

After the EPA, call NUS.

The important point is that Mr. Harrington, a whiz person, wanted to have people read each letter: E-P-A, not say eepa and say N-U-S not say noose. In Japanese, the sound “nus” has a negative meaning usually applied to pressurized body odor emissions. Not good.

Search and content processing vendors struggle with names. I have written about outfits which have fumbled the branding ball. Examples range from Thunderstone which has been usurped by a gaming company. Brainware which has been snagged and used for interesting videos. Smartlogic whose name has been appropriated by a smaller outfit doing marketing/design stuff. There are names which are impossible to find; for example, i2, AMI, and ChaCha to name a few among many.

I want to call attention to a quite useful product naming which I learned about recently. Navigate to Consider the word Terbium. Look for the word “Matchlight.”

I find Terbium a darned good word because terbium is an element, which my old (and I mean old) chemistry professor pronounced “ter-beem”). The element has a number of useful applications. Think solid sate devices and as a magic ingredient in some rocket fuels and—okay, okay—some explosives.

But as good as “terbium” is for a company I absolutely delight in this product name:


Now what’s Matchlight and why should anyone care. My hunch is that the technology which allows a next generation approach to content identification and other functions works to

  • light a match in the wilderness
  • illuminate a dark space
  • start a camp fire so I can cook a goose

You can and should learn more about Terbium Labs and its technology. The names will help you remember.

Important company; important technology. Great name Matchlight. (Hear that search and content processing vendors with dud names?)

Stephen E Arnold, July 11, 2015

SAS Text Miner Promises Unstructured Insight

July 10, 2015

Big data is tools help organizations analyze more than their old, legacy data.  While legacy data does help an organization study how their process have changed, the data is old and does not reflect the immediate, real time trends.  SAS offers a product that bridges old data with the new as well as unstructured and structured data.

The SAS Text Miner is built from Teragram technology.  It features document theme discovery, a function the finds relations between document collections; automatic Boolean rule generation; high performance text mining that quickly evaluates large document collection; term profiling and trending, evaluates term relevance in a collection and how they are used; multiple language support; visual interrogation of results; easily import text; flexible entity options; and a user friendly interface.

The SAS Text Miner is specifically programmed to discover data relationships data, automate activities, and determine keywords and phrases.  The software uses predictive models to analysis data and discover new insights:

“Predictive models use situational knowledge to describe future scenarios. Yet important circumstances and events described in comment fields, notes, reports, inquiries, web commentaries, etc., aren’t captured in structured fields that can be analyzed easily. Now you can add insights gleaned from text-based sources to your predictive models for more powerful predictions.”

Text mining software reveals insights between old and new data, making it one of the basic components of big data.

Whitney Grace, July 10, 2015

Sponsored by, publisher of the CyberOSINT monograph

Lexmark: Brainware, ISYS, and Kofax May Not Be Enough

July 5, 2015

Here I am. Sitting in the misty morn contemplating layoffs in the Louisville-Lexington region. At a Fourth of July party, the founder of a large Kentucky-based business reassured his listeners that there would be almost no layoffs as a result of the Aetna-Humana deal. I yawned.

My mind was not attending to the woes of Humana’s soon to be unemployed thousands. I was considering the news item I had just read on my trusty Blackberry Classic (right, no iPhone for me, gentle reader).

The short item was “Insider Selling: Lexmark International CFO David Reeder Sells 7,283 Shares of Stock (LXK).” Who was doing the selling? The person was David Reeder, the Lexmark chief financial officer. Perhaps Mr. Reeder has to send a child to school or must replace a cracking concrete driveway?

Lexmark beat some analyst estimates in its April 2015 quarterly statement. What’s the big deal?

The write up reports:

Several analysts have recently commented on the stock. Analysts at Goldman Sachs initiated coverage on shares of Lexmark International in a research note on Wednesday, June 17th. They set a “sell” rating and a $34.00 price target on the stock. Analysts at Zacks downgraded shares of Lexmark International from a “hold” rating to a “sell” rating in a research note on Wednesday, June 3rd. Analysts at Cross Research upgraded shares of Lexmark International from a “sell” rating to a “hold” rating and raised their price target for the stock from $36.00 to $43.00 in a research note on Thursday, May 14th. Analysts at Brean Capital reiterated a “hold” rating on shares of Lexmark International in a research note on Thursday, April 30th. Finally, analysts at TheStreet upgraded shares of Lexmark International from a “hold” rating to a “buy” rating in a research note on Tuesday, April 28th. Five analysts have rated the stock with a sell rating, four have issued a hold rating and two have assigned a buy rating to the company. The company currently has an average rating of “Hold” and an average target price of $39.29.

My question is, “Will revenues from the content processing acquisitions ignite Lexmark’s revenues and pump up the profits?” My research suggests that Lexmark may find that making big money from content centric software is no picnic on a warm sunny day.

I am rooting for the printer company, but I am a realist. Some Lexmarkians may want to keep their résumés sparkling and bright. When a CFO sells shares, I pay attention.

Stephen E Arnold, July 5, 2015

Silobreaker Takes Gold and Silver in Online Decathlon

July 4, 2015

Short honk: I have been a fan of the Silobreaker system, which is available for commercial and governmental content processing. I read Network Products Guide “New Products and Service: Winners 10th Annual 2015 IT Awards” recommended solutions league table this morning. Silobreaker, founded by a couple of wizards with military and commercial experience. According to the league table, the Silobreaker content processing and information access system is the top dog for applications centering in Europe, the Middle East and Asia. This means that the system’s multi-lingual capabilities were the best, according to the Network Products Guide’s editors. The company also nailed a silver medal for US focused solutions. You can get more information about Silobreaker at Sign up. Join the thousands of users who want to work with a winner.

Stephen E Arnold, July 4, 2015

CSC Attracts Buyer And Fraud Penalties

July 1, 2015

According to the Reuters article “Exclusive: CACI, Booz Allen, Leidos Eyes CSC’s Government Unit-Sources,” CACI International, Leidos Holdings, and Booz Allen Hamilton Holdings

have expressed interest in Computer Sciences Corp’s public sector division.  There are not a lot of details about the possible transaction as it is still in the early stages, so everything is still hush-hush.

The possible acquisition came after the news that CSC will split into two divisions: one that serves US public sector clients and the other dedicated to global commercial and non-government clients.  CSC has an estimated $4.1 billion in revenues and worth $9.6 billion, but CACI International, Leidos Holdings, and Booz Allen Hamilton might reconsider the sale or getting the price lowered after hearing this news: “Computer Sciences (CSC) To Pay $190M Penalty; SEC Charges Company And Former Executives With Accounting Fraud” from Street Insider.  The Securities and Exchange Commission are charging CSC and former executives with a $190 million penalty for hiding financial information and problems resulting from the contract they had with their biggest client.  CSC and the executives, of course, are contesting the charges.

“The SEC alleges that CSC’s accounting and disclosure fraud began after the company learned it would lose money on the NHS contract because it was unable to meet certain deadlines. To avoid the large hit to its earnings that CSC was required to record, Sutcliffe allegedly added items to CSC’s accounting models that artificially increased its profits but had no basis in reality. CSC, with Laphen’s approval, then continued to avoid the financial impact of its delays by basing its models on contract amendments it was proposing to the NHS rather than the actual contract. In reality, NHS officials repeatedly rejected CSC’s requests that the NHS pay the company higher prices for less work. By basing its models on the flailing proposals, CSC artificially avoided recording significant reductions in its earnings in 2010 and 2011.”

Oh boy!  Is it a wise decision to buy a company that has a history of stealing money and hiding information?  If the company’s root products and services are decent, the buyers might get it for a cheap price and recondition the company.  Or it could lead to another disaster like HP and Autonomy.

Whitney Grace, July 1, 2015

Sponsored by, publisher of the CyberOSINT monograph

Sprylogics Repositioned to Mobile Search

June 20, 2015

I learned about in a briefing in a gray building in a gray room with gray carpeting. The person yapping explained how i2 Ltd.-type relationship analysis was influencing certain intelligence-centric software. I jotted down some urls the speaker mentioned.

When I returned to my office, I check out the urls. I found the service interesting. The system allowed me to run a query, review results with inline extracts, and relationship visualizations among entities. In that 2007 version of’s system, I found the presentation, the inclusion of emails, phone numbers, and parent child relationships quite useful. The demonstration used queries passed against Web indexes. Technically, belonged to the category of search systems which I call “metasearch” engines. The Googles and Yahoos index the Web; added value. Nifty.

I chased down Alex Zivkovic, the individual then identified as the chief technical professional at Sprylogics. You can read my 2008 interview with Zivkovic in my Search Wizards Speak collection. The system originated with a former military professional’s vision for information analysis. According to Zivkovic, the prime mover for was Avi Shachar. At the time of the interview, the company focused on enterprise customers.

Zivkovic told me in 2008:

We have clustering. We have entity extraction. We have a relational ship analysis in a graph format. I want to point out that for enterprise applications, the functions are significantly more rich. For example, a query can be run across internal content and external content. The user sees that the internal information is useful but not exactly on point. Our graph technology makes it easy for the user to spot useful information from an external source such as the Web in conjunction with the internal information. With a single click, the user can be looking into those information objects. We think we have come up with a very useful way to allow an organization to give its professionals an efficient way to search for content that is behind the firewall and on the Web. The main point, however, is that user does not have to be trained. Our graphical interface makes it obvious what information is available from which source. Instead of formulating complex queries, the person doing the search can scan, click, and browse. Trips back to the search box are options, not mandatory.

I visited the Web site the other day and learned that the technology has been repackaged as a mobile search solution and real time sports application.

There is a very good explanation of the company’s use of its technology in a more consumer friendly presentation. You can find that presentation at this link, but the material can be removed at any time, so don’t blame me if the link is dead when you try to review the explanation of the 2015 version of Sprylogics.

From my point of view, the Sprylogics’ repositioning is an excellent example of how a company with technology designed for intelligence professionals can be packaged into a consumer application. The firm has more than a dozen patents, which some search and content processing companies cannot match. The semantic functions and the system’s ability to process Web content in near real time make the firm’s Poynt product interesting to me.

Sprylogics’ approach, in my opinion, is a far more innovative approach to leveraging advanced content processing capabilities than approaches taken by most search vendors. It is easier to slap a customer relationship management, customer support, or business intelligence label on what is essential search and retrieval software than create a consumer facing app.

Kudos to Sprylogics. The ArnoldIT team hopes their stock, which is listed on the Toronto Stock Exchange, takes wing.

Stephen E Arnold, June 20, 2015

Content Grooming: An Opportunity for Tamr

June 20, 2015

Think back. Vivisimo asserted that it deduplicated and presented federated search results. There are folks at Oracle who have pointed to Outside In and other file conversion products available from the database company as a way to deal with different types of data. There are specialist vendors, which I will not name, who are today touting their software’s ability to turn a basket of data types into well-behaved rows and columns complete with metatags.

Well, not so fast.

Unifying structured and unstructured information is a time consuming, expensive process. The reasons for the obese exception files where objects which cannot be processed go to live out their short, brutish lives.

I read “Tamr Snaps Up $25.2 Million to Unify Enterprise Data.” The stakeholders know, as do I, that unifying disparate types of data is an elephant in any indexing or content analytics conference room. Only the naive believe that software whips heterogeneous data into Napoleonic War parade formations. Today’s software processing tools cannot get undercover police officers to look ship shape for the mayor.

Ergo, an outfit with an aversion to the vowel “e” plans to capture the flag on top of the money pile available for data normalization and information polishing. The write up states:

Tamr can create a central catalogue of all these data sources (and spreadsheets and logs) spread out across the company and give greater visibility into what exactly a company has. This has value on so many levels, but especially on a security level in light of all the recent high-profile breaches. If you do lose something, at least you have a sense of what you lost (unlike with so many breaches).

Tamr is correct. Organizations don’t know what data they have. I could mention a US government agency which does not know what data reside on the server next to another server managed by the same system administrator. But I shall not. The problem is common and it is not confined to bureaucratic blenders in government entities.

Tamr, despite the odd ball spelling, has Michael Stonebraker, a true wizard on the task. The write up mentions an outfit what might be politely described as a “database challenge” as a customer. If Thomson Reuters cannot figure out data after decades of efforts and millions upon millions of investment, believe me when I point out that Tamr may be on to something.

Stephen E Arnold, June 20, 2015

Search Cheerleader Seeks Text Analytics Unicorns

June 12, 2015

The article on Venture Beat whimsically titled Where Are the Text Analytics Unicorns provides yet another cheerleader for search. The article uses Aileen Lee’s “unicorn” concept of a company begun since 2003 and valued at over a billion dollars. (“Super unicorns” are companies valued at over a hundred billion dollars like Facebook.) The article asks why no text analytics companies have joined this exclusive club? Candidates include Clarabridge, NetBase and Medallia.

“In the end, the answer is a very basic one. Contrast the text analytics sector with unicorns that include Uber — Travis Kalanick’s company — and Airbnb, Evernote, Flipkart, Square, Pinterest, and their ilk. They play to mass markets — they’re a magic mix of revenue, data, platform, and pizazz — in ways that text analytics doesn’t. The tech companies on the unicorn list — Cloudera, MongoDB, Pivotal — provide or support essential infrastructure that covers a broad set of needs.”

Before coming to this conclusion, the article posits other possible reasons as well, such as the sheer number of companies competing in the field, or even competition from massive companies like IBM and Google. But these are dismissed for the more optimistic end note that essentially suggests we give the text analytics unicorns a year. Caution advised.

Chelsea Kerwin, June 12, 2015

Sponsored by, publisher of the CyberOSINT monograph


Lexalytics: GUI and Wizard

June 12, 2015

What is one way to improve a user’s software navigational experience?  One of the best ways is to add a graphical user interface (GUI).  Software Development @ IT Business Net shares a press release about “Lexalytics Unveils Industry’s First Wizard For Text Mining And Sentiment Analysis.”  Lexalytics is one of the leading companies that provides sentiment and analytics solutions and as the article’s title explains it has made an industry first by releasing a GUI and wizard for Semantria SaaS platform and Excel plug-in.  The wizard and GUI (SWIZ) are part of the Semantria Online Configurator, SWEB 1.3, which also included functionality updates and layout changes.

” ‘In order to get the most value out of text and sentiment analysis technologies, customers need to be able to tune the service to match their content and business needs,’ said Jeff Catlin, CEO, Lexalytics. ‘Just like Apple changed the game for consumers with its first Macintosh in 1984, making personal computing easy and fun through an innovative GUI, we want to improve the job of data analysts by making it just as fun, easy and intuitive with SWIZ.’”

Lexalytics is dedicated to helping its clients enjoy an easier experience when it comes to data analytics.  The company wants its clients to get the answers they by providing the tools they need to get them without having to over think the retrieval process.  While Lexalytics already provides robust and flexible solutions, the SWIZ release continues to prove it has the most tunable and configurable text mining technology.

Whitney Grace, June 12, 2015

Sponsored by, publisher of the CyberOSINT monograph

« Previous PageNext Page »