Real Time: Maybe, Maybe Not

March 1, 2016

Years ago an outfit in Europe wanted me to look at claims made by search and content processing vendors about real time functions.

The goslings and I rounded up the systems, pumped our test corpus through, and tried to figure out what was real time.

The general buzzy Teddy Bear notion of real time is that when new data are available to the system, the system processes the data and makes them available to other software processes and users.

The Teddy Bear view is:

  1. Zero latency
  2. Works reliably
  3. No big deal for modern infrastructure
  4. No engineering required
  5. Any user connected to the system has immediate access to reports including the new or changed data.

Well, guess what, Pilgrim?

We learned quickly that real time, like love and truth, is a darned slippery concept. Here’s one view of what we learned:

image

Types of Real Time Operations. © Stephen E Arnold, 2009

The main point of the chart is that there are six types of real time search and content processing. When someone says, “Real time,” there are a number of questions to ask. The major finding of the study was that for near real time processing for a financial trading outfit, the cost soars into seven figures and may keep on rising as the volume of data to be processed goes up. The other big finding was that every real time system introduces latency. Seconds, minutes, hours, days, and weeks may pass before the update actually becomes available to other subsystems or to users. If you think you are looking at real time info, you may want to shoot us an email. We can help you figure out which type of “real time” your real time system is delivering. Write benkent2020 @ yahoo dot com and put Real Time in the subject line, gentle reader.

I thought about this research project when I read “Why the Search Console Reporting Is not real time: Explains Google!” As you work through the write up, you will see that the latency in the system is essentially part of the woodwork. The data one accesses is stale. Figuring out how stale is a fairly big job. The Alphabet Google thing is dealing with budgets, infrastructure costs, and a new chief financial officer.

Real time. Not now and not unless something magic happens to eliminate latencies, marketing baloney, and user misunderstanding of real time.

Excitement in non real time.

Stephen E Arnold, March 1, 2016

Natural Language Processing App Gains Increased Vector Precision

March 1, 2016

For us, concepts have meaning in relationship to other concepts, but it’s easy for computers to define concepts in terms of usage statistics. The post Sense2vec with spaCy and Gensim from SpaCy’s blog offers a well-written outline explaining how natural language processing works highlighting their new Sense2vec app. This application is an upgraded version of word2vec which works with more context-sensitive word vectors. The article describes how this Sense2vec works more precisely,

“The idea behind sense2vec is super simple. If the problem is that duck as in waterfowl andduck as in crouch are different concepts, the straight-forward solution is to just have two entries, duckN and duckV. We’ve wanted to try this for some time. So when Trask et al (2015) published a nice set of experiments showing that the idea worked well, we were easy to convince.

We follow Trask et al in adding part-of-speech tags and named entity labels to the tokens. Additionally, we merge named entities and base noun phrases into single tokens, so that they receive a single vector.”

Curious about the meta definition of natural language processing from SpaCy, we queried natural language processing using Sense2vec. Its neural network is based on every word on Reddit posted in 2015. While it is a feat for NLP to learn from a dataset on one platform, such as Reddit, what about processing that scours multiple data sources?

 

Megan Feil, March 1, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

« Previous Page

  • Archives

  • Recent Posts

  • Meta