Parts Unknown of Dark Web Revealed in Study

May 13, 2016

While the parts unknown of the internet is said to be populated by terrorists’ outreach and propaganda, research shows a different picture. Quartz reports on this in the article, The dark web is too slow and annoying for terrorists to even bother with, experts say. The research mentioned comes from Thomas Rid and Daniel Moore of the Department of War Studies at King’s College London. They found 140 extremist Tor hidden services; inaccessible or inactive services topped the list with 2,482 followed by 1,021 non-illicit services. As far as illicit services, those related to drugs far outnumbered extremism with 423. The write-up offers a few explanations for the lack of terrorists publishing on the Dark Web,

“So why aren’t jihadis taking advantage of running dark web sites? Rid and Moore don’t know for sure, but they guess that it’s for the same reason so few other people publish information on the dark web: It’s just too fiddly. “Hidden services are sometimes slow, and not as stable as you might hope. So ease of use is not as great as it could be. There are better alternatives,” Rid told Quartz. As a communications platform, a site on the dark web doesn’t do what jihadis need it to do very well. It won’t reach many new people compared to “curious Googling,” as the authors point out, limiting its utility as a propaganda tool. It’s not very good for internal communications either, because it’s slow and requires installing additional software to work on a mobile phone.”

This article provides fascinating research and interesting conclusions. However, we must add unreliable and insecure to the descriptors for why the Dark Web may not be suitable for such uses.

Megan Feil, May 13, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Google, Mobile, News, Security, Social Media, Technology | Comments Off on Parts Unknown of Dark Web Revealed in Study

Update from Lucene

May 10, 2016

It has been awhile since we heard about our old friend Apache Lucene, but the open source search engine has something new, says Open Source Connections in the article, “BM25 The Next Generation Of Lucene Relevance.” Lucene is added BM25 to its search software and it just might improve search results.

“BM25 improves upon TF*IDF. BM25 stands for “Best Match 25”. Released in 1994, it’s the 25th iteration of tweaking the relevance computation. BM25 has its roots in probabilistic information retrieval. Probabilistic information retrieval is a fascinating field unto itself. Basically, it casts relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant.”

Apache Lucene formerly relied on TF*IDF, a way to rank how users value a text match relevance. It relied on two factors: term frequency-how often a term appeared in a document and inverse document frequency aka idf-how many documents the term appears and determines how “special” it is. BM25 improves on the old TF*IDF, because it gives negative scores for terms that have high document frequency. IDF in BM25 solves this problem by adding a 1 value, therefore making it impossible to deliver a negative value.

BM25 will have a big impact on Solr and Elasticsearch, not only improving search results and accuracy with term frequency saturation.

Whitney Grace, May 10, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Data, News, Search, Technology | 1 Comment

Machine Learning: 10 Numerical Recipes

April 8, 2016

The chatter about smart is loud. I cannot hear the mixes on my Creamfields 2014 CD. Mozart, you are a goner.

If you want to cook up some smart algorithms to pick music or drive your autonomous vehicle without crashing into a passenger carrying bus, navigate to “Top 10 Machine Learning Algorithms.”

The write up points out that just like pop music, there is a top 10 list. More important in my opinion is the concomitant observation that smart software may be based on a limited number of procedures. Hey, this stuff is taught in many universities. Go with what you know maybe?

What are the top 10? The write up asserts:

Linear regression
Logistic regression
Linear discriminant analysis
Classification and regression trees
Naive Bayes
K nearest neighbors
Learning vector quantization
Support vector machines
Bagged decision trees and random forest
Boosting and AdaBoost.

The article tosses in a bonus too: Gradient descent.

What is interesting is that there is considerable overlap with the list I developed for my lecture on manipulating content processing using shaped or weaponized text strings. How’s that, Ms. Null?

The point is that when systems use the same basic methods, are those systems sufficiently different? If so, in what ways? How are systems using standard procedures configured? What if those configurations or “settings” are incorrect?

Exciting.

Stephen E Arnold, April 8, 2016

Written by Stephen E. Arnold · Filed Under AI, algorithms, News, Text analytics, Text processing | 2 Comments

Google Maps: Accuracy Is Relative

April 1, 2016

Editorial comment: Not April Fool bit of spoofery.

I read “Demolition Company Says a Google Maps Error Led Them to Tear Down the Wrong House.” Pesky humans. Google’s autonomous automobiles do not have accidents. When those accidents occur, a human is at fault. Bus drivers, grrrr.

The write up suggests that a Google Map caused a human to instruct demolition workers to level a house at 7601 Cousteau Drive. The human allegedly pointed a finger at Google Maps, a geospatial system which has some big fans in various governmental outfits.,

Here’s what the story asserts:

Google Maps has declined to make a statement, but it did fix the map to pin the correct address.

Yep, what does a “real” journalist expect when it wants to ask about one of Google’s algorithmic services?

Life would be simpler if humans were not getting in the way of Google efficiency, solutions, and services.

Thought: Perhaps one should not use an Android phone and Google Maps to navigate to the edge of the Grand Canyon.

Stephen E Arnold, April 1, 2016

Written by Stephen E. Arnold · Filed Under algorithms, Google, News | Comments Off on Google Maps: Accuracy Is Relative

Netflix Algorithm Defaults To “White” Content, Sweeps Diversity Under the Rug

April 1, 2016

The article Marie Claire titled Blackflix; How Netflix’s Algorithm Exposes Technology’s Racial Bias, delves into the racial ramifications of Netflix’s much-lauded content recommendation algorithm. Many users may have had strange realizations about themselves or their preferences due to collisions with the system that the article calls “uncannily spot-on.” To sum it up: Netflix is really good at showing us what we want to watch, but only based on what we have already watched. When it comes to race, sexuality, even feminism (how many movies have I watched in the category “Movies With a Strong Female Lead?”), Netflix stays on course by only showing you similarly diverse films to what you have already selected. The article states,

“Or perhaps I could see the underlying problem, not in what we’re being shown, but in what we’re not being shown. I could see the fact that it’s not until you express specific interest in “black” content that you see how much of it Netflix has to offer. I could see the fact that to the new viewer, whose preferences aren’t yet logged and tracked by Netflix’s algorithm, “black” movies and shows are, for the most part, hidden from view.”

This sort of “default” suggests quite a lot about what Netflix has decided to put forward as normal or inoffensive content. To be fair, they do stress the importance of logging preferences from the initial sign up, but there is something annoying about the idea that there are people who can live in a bubble of straight, white, (or black and white) content. There are among those people some who might really enjoy and appreciate a powerful and relevant film like Fruitvale Station. If it wants to stay current, Netflix needs to show more appreciation or even awareness of its technical bias.

Chelsea Kerwin, April 1, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Consumer, Data, News, Search, User experience | 1 Comment

Stanford Offers Course Overviewing Roots of the Google Algorithm

March 23, 2016

The course syllabus for Stanford’s Computer Science class titled CS 349: Data Mining, Search, and the World Wide Web on Stanford.edu provides an overview of some of the technologies and advances that led to Google search. The syllabus states,

“There has been a close collaboration between the Data Mining Group (MIDAS) and the Digital Libraries Group at Stanford in the area of Web research. It has culminated in the WebBase project whose aims are to maintain a local copy of the World Wide Web (or at least a substantial portion thereof) and to use it as a research tool for information retrieval, data mining, and other applications. This has led to the development of the PageRank algorithm, the Google search engine…”

The syllabus alone offers some extremely useful insights that could help students and laypeople understand the roots of Google search. Key inclusions are the Digital Equipment Corporation (DEC) and PageRank, the algorithm named for Larry Page that enabled Google to become Google. The algorithm ranks web pages based on how many other websites link to them. John Kleinburg also played a key role by realizing that websites with lots of links (like a search engine) should also be seen as more important. The larger context of the course is data mining and information retrieval.

Chelsea Kerwin, March 23, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Data, Data mining, Google, News, Search | Comments Off on Stanford Offers Course Overviewing Roots of the Google Algorithm

A Dead Startup Tally Sheet

March 17, 2016

Startups are the buzzword for companies that are starting up in the tech industry, usually with an innovative idea that garners them several million in investments. Some startups are successful, others plodder along, and many simply fail. CBS Insights makes an interesting (and valid) comparison with tech startups and dot-com bust that fizzled out quicker than a faulty firecracker.

While most starts appear to be run by competent teams that, sometimes they fizzle out or are acquired by a larger company. Many of them are will not make it as a headlining company. As a result, CBS Insights invented, “The Downround Tracker: Which Companies Are Not Living Up To The Expectations?”

CBS Insights named this tech boom, the “unicorn era,” probably from the rare and mythical sightings of some of these companies. The Downround Tracker tracks unicorn era startups that have folded or were purchased. Since 2015, fifty-six total companies have made the Downround Tracker list, including LiveScribe, Fab.com, Yodle, Escrow.com, eMusic, Adesto Technologies, and others.

Browse through the list and some of the names will be familiar and others will make you wonder what some of these companies did in the first place. Companies come and go in a fashion that appears to be quicker than any other generation. At least in shows that human ingenuity is still working, cue Kanas’s “Dust in the Wind.”

Whitney Grace, March 17, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Data, Digital Library, Management, News, Search, Security | Comments Off on A Dead Startup Tally Sheet

Bitcoin Textbook to Become Available from Princeton

March 16, 2016

Bitcoin is all over the media but this form of currency may not be thoroughly understood by many, including researchers and scholars. An post on this topic, The Princeton Bitcoin textbook is now freely available, was recently published on Freedom to Tinker, a blog hosted by Princeton’s Center for Information Technology Policy. This article announces the first completed draft of a Princeton Bitcoin textbook. At 300 pages, the manuscript is geared to those who hope to gain a technical understanding of how Bitcoin works and is appropriate for those who have a basic understanding of computer science and programming. According to the write-up,

“Researchers and advanced students will find the book useful as well — starting around Chapter 5, most chapters have novel intellectual contributions. Princeton University Press is publishing the official, peer-reviewed, polished, and professionally done version of this book. It will be out this summer. If you’d like to be notified when it comes out, you should sign up here. Several courses have already used an earlier draft of the book in their classes, including Stanford’s CS 251. If you’re an instructor looking to use the book in your class, we welcome you to contact us, and we’d be happy to share additional teaching materials with you.”

As Bitcoin educational resources catch fire in academia, it is only a matter of time before other Bitcoin experts begin creating resources to help other audiences understand the currency of the Dark Web. Additionally, it will be interesting to see if research emerges regarding connections between Bitcoin, the Dark Web and the mainstream internet.

Megan Feil, March 16, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Dark Web, News, Security, Technology | 1 Comment

Google Now Has Dowsing Ability

March 16, 2016

People who claim to be psychic are fakes. There is not a way to predict the future, instantly locate a lost person or item, or read someone’s aura. No scientific theory has proven it exists. One of the abilities psychics purport to have is “dowsing,” the power to sense where water, precious stones or metals, and even people are hiding. Instead of relying on a suspended crystal or an angular stick, Google now claims it can identify any location based solely on images, says The Technology Review in the article, “Google Unveils Neural Network With ‘Superhuman’ Ability To Determine The Location Of Almost Any Image.”

Using computer algorithms, not magic powers, and Tobias Weyand’s programming prowess and a team of tech savvy people, they developed a way for a Google deep-learning machine to identity location pictures. Weyand and his team designed PlaNET, the too, and accomplished this by dividing the world into 26,000 square grid (sans ocean and poles) of varying sizes depending on populous areas.

“Next, the team created a database of geolocated images from the Web and used the location data to determine the grid square in which each image was taken. This data set is huge, consisting of 126 million images along with their accompanying Exif location data.

Weyand and co used 91 million of these images to teach a powerful neural network to work out the grid location using only the image itself. Their idea is to input an image into this neural net and get as the output a particular grid location or a set of likely candidates.”

With the remaining 34 million images in the data set, they tested the PlaNET to check its accuracy. PlaNET can accurately guess 3.6% images at street level, 10.1% on city level, 28.4% country of origin, and 48% of the continent. These results are very good compared to the limited knowledge that a human keeps in their head.

Weyand believes that PlaNET is able to determine the location, because it has learned new parents to recognize subtle patterns about areas that humans cannot distinguish, as it has arguably been more places than any human. What is even more amazing is how much memory PlaNET uses: only 377 MB!

When will PlaNET become available as a GPS app?

Whitney Grace, March 16, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Corporate Concerns, Data, Google, News, Technology | Comments Off on Google Now Has Dowsing Ability

Natural Language Processing App Gains Increased Vector Precision

March 1, 2016

For us, concepts have meaning in relationship to other concepts, but it’s easy for computers to define concepts in terms of usage statistics. The post Sense2vec with spaCy and Gensim from SpaCy’s blog offers a well-written outline explaining how natural language processing works highlighting their new Sense2vec app. This application is an upgraded version of word2vec which works with more context-sensitive word vectors. The article describes how this Sense2vec works more precisely,

“The idea behind sense2vec is super simple. If the problem is that duck as in waterfowl andduck as in crouch are different concepts, the straight-forward solution is to just have two entries, duck _N and duck _V. We’ve wanted to try this for some time. So when Trask et al (2015) published a nice set of experiments showing that the idea worked well, we were easy to convince.

We follow Trask et al in adding part-of-speech tags and named entity labels to the tokens. Additionally, we merge named entities and base noun phrases into single tokens, so that they receive a single vector.”

Curious about the meta definition of natural language processing from SpaCy, we queried natural language processing using Sense2vec. Its neural network is based on every word on Reddit posted in 2015. While it is a feat for NLP to learn from a dataset on one platform, such as Reddit, what about processing that scours multiple data sources?

Megan Feil, March 1, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Data mining, News, Search, Social Media, Technology | Comments Off on Natural Language Processing App Gains Increased Vector Precision

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Parts Unknown of Dark Web Revealed in Study

Update from Lucene

Machine Learning: 10 Numerical Recipes

Google Maps: Accuracy Is Relative

Netflix Algorithm Defaults To “White” Content, Sweeps Diversity Under the Rug

Stanford Offers Course Overviewing Roots of the Google Algorithm

A Dead Startup Tally Sheet

Bitcoin Textbook to Become Available from Princeton

Google Now Has Dowsing Ability

Natural Language Processing App Gains Increased Vector Precision

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta