Borchardt Projects: March 2008

Friday, March 21, 2008

eTech - Game Plan

Slides to a great talk about energy use: http://wattzon.org/GamePlan_v1.0_slides.pdf

[updated link]
GamePlan_v1.0

Thursday, March 6, 2008

eTech - 2008 Presentation Slides

O’Reilly Emerging Technologies 2008 Presentation Slides can be found at: http://en.oreilly.com/et2008/public/schedule/proceedings/

-80/20 principal:
First, try to identify the 20% of the things you do during the week that take 80% of your time. For the web, you can use tools like rescueTime.com
From this list, make a NOT to do list, and attempt to not do those things.
During the day, once an hour or two, ask yourself, "Am I busy? Am I being bothered? Am I creating work to prevent me from doing what I should be doing?"
Second, try to identify the 20% of measurements that track 80% of your success.

-Attention dependency on time:
Constant partial attention can actually cause ADHD. Don't pay attention to more than one thing at a time. Set up a firewall so you can focus on a single task. Don't think about things that you cannot affect today. Don't read work email while on vacation. When trying to work, think about work, when trying to relax, think about things that you like.

-Error based performance:
We don’t like to fail at anything, but it’s ok.
What is the one thing that you could do that would change everything for the better?
What are the small tasks that seem like they have to be done right now?
Would it be ok to fail any of the small tasks to get closer to the one big thing?
It’s ok to fail the little stuff, as long as you move toward the big stuff.

More can be found at: fourhourworkweek.com

eTech - Ensemble Learning, Better predictions through diversity

Presentation by Todd Holloway

Ensemble is the process of using multiple supervised learning models to make a prediction. This talk is arguing that using multiple types of predictor models turns out more statistically correct results.

Relating movies based on user recommendations does not work will for relatedness because we don't have enough data on all movies.

Netflix prize: 17000 sample movies, millions of sample ratings. One million dollar prize for a ten percent improvement on current Netflix model.

Using multiple models decreases error as long as they are independent decision makers.
To get independence and diversity, we use different relatedness measures for each model.

This adds complexity but gives better results, which is a violation of Ockham's Razor.

AdaBoost is the process of trying a classifier, testing it, take the incorrect results, and using them to train a new classifier. Unfortunately, this emphasizes noise.

www.abeautifulwww.com for slides and more info.

eTech - CouchDB from 1000ft

Presentation by Damien Katz

Website

CouchDB is an EASY database. A simple way to store data.

When designing a relational database, you are designing a large data structure. With CouchDB, you are just storing data.

Documents are complete units of data that is not broken up. Example, business card. This means that documents may be out of date. In a relation db, this is the worst thing that can happen, but in the real world, we deal with it all the time.

CouchDB is in JSON, similar to XML, but easy to read and write.

CouchDB is supported by all major languages and does not require a data access layer.
Communication is across HTTP API.

Indexes are built incrementally with map reduce over the tags.

Data is replicated across machines via peer based replication.
Conflicts are taken care of by the db, a winner and loser change is chooses consistently on all machines.

If you are interested in distributed programming, look into 'Erlang'. CouchDB is written in Erlang because it makes distributed programming a snap.

Read writes can happen at the same time; the reader will get the older value.

20,000 concurrent users running on a laptop. This works because Erlang does not use OS processes, they use their own lightweight process.

Comes with Lucene integration for full-text search. But any search tech could be used.

To access the db, you write javascript/ajax.

Largest replication so far= 5gb, 400,000 documents.

Not yet ready for use in production. But people ARE using it.

If a relational db model is what you need, then use that...

Wednesday, March 5, 2008

eTech - Visualizations beyond RSS and LavaLamps (Tripledex)

You NEED to have a question to have a good visualization.
Even a Taro reader will demand a question.

You can make visualizations that show everything, but these usually end up being used esthetically. This is contradiction with a speaker earlier in the conference, so we will need to think about the context in which each was speaking.

Get answers to these questions before making visualizations:
Do you want broad or deep, trends or targeting?
How much data do you want/need?
Static or dynamic? Alerts? About what?
Speed of data. Pipette or firehouse?
Lumpiness of data
Dimensionality of data
Do you need to combine relationships?

The tools you use to show data or filter data are not interchangeable.
Just like tools around the house, you would not use a lawnmower to cut bread, you would use a knife.

They are showing a technology called Tripledex that can manage around 100,000 relationships.
Email jnhq@yahoo.com to get access to the demo

eTech - How to Kick Ass

Passionate users are passionate about what they are good at.

If companies provide the ability for users to be good at out program, they will be passionate about it.

Neurogenesis...your brain can keep changing.
Dull cubicle kills the brain.
World class ability is not about talent, it is about putting in the time.
You need a 'rage to master'

6 Expertise Hacks

-1 Exploit your telepathy ... mirror neurons
Our brain can learn just from watching actions
We jump out of the way when people see someone else get hurt
We can feel emotion from seeing a face
We can simulate another persons brain when we watch them
This is more effective if we have DONE what we are watching
This is the science behind visualizing to get better
It works better if you visualize what they would actually see, rather than 3rd person

-2 Reduce interference
Don't think about what we have to do
Tell dumber part of the brain to shut up
Doing with Images makes symbols

-3 Manage your fight/flight
Get a 'the stress eraser' StreesEraser.com, Amazon

-4 Learn about your brain
Legacy brain is trying to stop you
It says ruby is not important to life/death

-5 Exercise your
Brain age is ok, but REAL exercise is more important

-6 Find the time
The twitter curve is messing us up.
We need to find time to practice what we want to be good at, don’t waste time with stuff you don’t think is important.

eTech - Elephant programs

Elephant is a glimpse at how programming languages will behave in the future. The more we know about where we are going, the better prepared for it we will be.

Elephant programs are faithful 100 percent.

They never forget
"Passenger has a reservation - compiler makes the db or array"
"Does the passenger have a reservation?"

They interact with other persons
"You have a reservation on flight UA 522 today at 7:35 pm"
This speech act, if authorized, creates an obligation

Features-
Communication inputs and outputs are meaningful speech acts.
A promise will be expressed by a string of symbols, but the meaning is semantics is a promise, not a string.
Correctness of a program is partially defined in terms of performance of speech acts.

We can look at programs to have beliefs. A thermostat can believe it is too cold, too hot, or OK. It does not have a consciousness, but it does have beliefs.

Programs can be represented as sentences of logic.

They interact non-trivially with the outside world. They have input/output AND accomplishment specifications.

A program can have internal promises.

ABRUPT END
The speaker was cut off, we will need to look into Elephant on our own.

Tuesday, March 4, 2008

eTech - Predicting Markets and the Flow of Information (Google)

Prediction market presentation by Bo Cowgill from Google

One way for management to get data about the company is to ask its employees their opinion. Problem is, when they ask, we tend to tell them what they want to hear.
If you let people predict anonymously, they will tell the truth, because there is no down side, and they might win something. (A tee-shirt or their name on a leader board)

Google has prediction markets rolling with over 80,000 trades currently.
They have found that the bias of optimism from new employees causes a safer bet to be against Google.
Most people bet in the middle of five options, and the middle often lost. It would have been safer to bet high or low.

Information flows around a company.
Knowing data about employees, and their trading habits can tell us who was talking to whom.
They found that the biggest factor was the location employees are sitting.

This is ironic, because Google is trying to make non-local communicate easier.

Software like news futures can be used for creating the prediction market.

eTech - Next generation of online gaming (Sun Microsystems DARKSTAR)

One of the issues companies that deal with many users and lots of data, is that the DB and DB access machines need to be built out for max users.

Blizzard went from a game company to a service company.
Webkinz is a toy company becoming a social game company.
Another example: Club penguin, Habbo hotel

Scaling is the major issue. It’s hard.

Currently games break up the world to simplify the problem. This is bad because it makes for a constrained game design. Further, this causes a waste of server time and energy. Not all groups are playing at peak. Further, it is complexity the user does not understand. They just want to play with their friends.

If you shard incorrectly, you have empty servers or over-capacitated servers.

$30-$90 million to make an online game, over 3-4 years.

Sun wants to make architecture to help small companies make great games without the need to shard with 'Project Darkstar'. More info http://www.projectdarkstar.com/

This is an application server for games. OS agnostic, game agnostic. But it would work for non game applications that require lots of data transactions, and might give us a glimpse into how data access will be in the future.
The technology is under gpl, but sun will make commercial licenses if asked.

Games are multiplayer, but characters are actually agnostic. Let’s distribute their actions without the developers needing to know about it.

Games are event driven, and have small tasks per action. What is hard is if two characters have contention for the same resource. All tasks are transactional. Use this to find conflicts and allow one character to get the resource and one does not. They are moving competing characters to the same server to optimize scheduling transactions.

All data and communications will go through darkstar. Abstracting the developer and client away from the data access.

The program can think of this as a single thread and a single machine.

If a box goes down, play continues. The players are not connected to a box; they are connected to a channel.

This also allows you to use the same machines for MULTIPLE concurrent games.

Go here for more info.

eTech - Practice Makes Perfect

Presentation from Peter Norvig from Google
How billions of examples lead to better models of images and text

How things are traditionally figured out.
Look at world, think about data, and figure out a model to express the world.
Problem is that this is hard and the model will be wrong.

Instead, let the data do the work.
Computing power is making it possible to make more complex algorithms because we can easily test bad algorithms to find the good ones. Example- image resizing

More data is also allowing this. Example- scene completion

For finding similar images, do a search based on keyword, see the images, user an algorithm to find similarities in photos. Use the Eigenface and SIFT features to find commonalities in images. Then rank the found images by what links to what, not on how often they are linked.

For text, grep the data to find words that are in proximity, or look in structured data, and use probabilistic models to guess the most probably answer. Example- Google Sets

Engineers later dropped the probabilistic model in favor for a liner model. They have moved away from something they can prove, and into something that can observe working.
They have optimized for translating news.

Bayesian: want argmaxc P(c|w), but model argmaxc P(w|c) P(c)

see: How to build a spell checker

Sun SPOT Java Development Kit

This is neat...

Sun SPOT

It is a small computer with a:
- a radio transmitter
- 2 accelerometers
- 6 inputs
- 5 outputs
- a battery power pack
- a java micro-controller
- a light sensor
- a temp sensor
- an 8 led display

You can use it to do almost anything. Here are a few obvious examples:
- Control a lego car wirelessly
- Control an on screen avatar wirelessly
- Track your day to day movements
- Turn on your house lights/heat/music by voice or time
- Alter a web-cam via the web

If you have a good open source project, they will give you some for free!
If you are a student you get 2 for $300, else they are 2 for $700.

Monday, March 3, 2008

Prediction theory

If a boss asks a manager when a product will be ready to release, the manager will say 2 days... because s/he does not want to look bad.

If, instead, the boss says, hey, whoever predicts the day closest to the day the product is ready to release will win $500.00 (or whatever), the manager will guess 2 weeks.

In the later example, the boss gets the more realistic answer because the manager wants to win and does not feel that there is an expected answer.

eTech - Debugging Hacks 2

Good habits for debugging:
- Keep a log of everything you observe/change
- When you get an error message, Google it
- Look on forums that Google does not index
- Graph data over time, make a truth table / chart to visualize the problem
- Log the server AND client time in web logs
- Look in bug db for similar bugs
- After having a repro case, if you are still having issues fixing the bug, try to find ANOTHER repro case
- Reread and update the bug every day you work on it
- Take baby steps, "If you cannot see land, can you see birds?"
- The worse the bug, the more logging you need
- Get on a mailing list / news group
- Recheck assumptions
- Go back to code that works, and start taking diffs
- Iff needed use binary search debugging (Spolsky)
- Explain the problem to someone else
- Get more eyes for the problem, present to a group
- Has there been a hardware change, or simple change to the environment
- Go home if you are done for the day (death march != fixed bugs)

eTech - Debugging Hacks

This seems obvious, but I'll write it down...

For really bad bugs:
0- Try to fix the bug quickly
1- Revert any changes you made trying to fix the bug quickly
2- Collect data from each component, logs, etc...
3- Reproduce the bug and automate it
4- Simplify the bug conditions when possible
5- Look for connections and coincidences in the data
6- Brainstorm theories and test them
7- When you fix the bug, verify against the report
8- Make sure fix does not break other code

Bug tracking notes:
Break data for a bug into three categories, and log them correctly.
1- Facts
2- Questions
3- Theories that turned out to be wrong

eTech - Live Vast and Deep WebNative Visualizations

Processing is a subset of Java found at processing.org.
It can be used to quickly make data visualizations. It is not as powerful as java, but it has been whittled down to be the most useful parts of java for visualization to make quick prototyping faster and more dynamic. To find lots of source code, search for ‘built with processing’.

Design starts on paper; this obvious for my designer friends. Well, visualization starts in the eye, or the minds eye. We need to begin data visualization by imagining what we dare to see.

The first step in web visualization is to show everything. Don’t waste time trying to figure out the best thing to show, or how to show it. If you can map it to the screen, do it. Get the data to the user (maybe YOU) quickly, then iterate on the visualization until you find what you are looking for.

Displaying data on the screen is cheap and easy. Analyzing the data is expensive and hard. Begin by showing everything, and then use humans to analyze. Humans are better at this when it comes to quick pattern recognition.

Showing everything also limits bias that is naturally going to creep in as you filter down.

Good examples of showing everything are: zip decoder, cab spotting, and many of the digg visualizations.

After you have everything showing, filter down the data, and look for meaningful relationships. Too often developers leave the visualization in the initial form, which makes it hard to use or understand for the user. Simple is often better.

Other interesting visualizations: Oakland crime spotting, digg arc, mappr, trulia.

When displaying large amounts of data, think about representing the data as multiple slices in pixel format. Literally make an image that depicts the data ‘under’ that pixel. These masks can then be used in conjunction with each other to form rich data visualizations. Example: construct a black and white heat map of population over the area you are interested in, also construct a black and white heat map of income over the same region. With these two images, and a map, you can programmatically decide how to display the data at a pixel by looking at each image and querying for the gray value at the point. This has vast possibilities.

If you can count something, you can color it. Duh, however, this is helpful in deciding how to color objects in you map. Tools like Color Brewer help users make good pallets for complex data scenes.

If you want to count text so you can color it, one easy way is to md5 hash the text. This gives you one color for that snippet of text wherever it is used. Two greens WILL NOT be similar text, but the same red as another red IS the same.

Spark lines are neat. Old concept from Tufte, but not used enough.

Neat visualization: IBM History flow.

The second step in web visualization is to identify the objects of interest. People, places, events, locations, costs, weights, etc… Display the objects, draw crude relationships between them, and look for other relationships. Look at the scene in terms of one object, then look at the objects that are related to that object, then keep going.

Examples: graphVis, touch graphs.

The third step is interaction.
Sliders are easy. Easy is good. Pick a data or relationship, and allow the user to see what happens when they can alter it via slider.

What is better than a slider? A Scented Map. A scented map is a slider that displays data itself. A slider with a chart IN it.

Example: Measure Map.

What is better than a Scented map? A play button that allows the slider to animate over its length. You now have animation…

The fourth step is to provide links to the visualization, and allow scripting. If you embed data on how to generate the exact visualization a user is seeing inside the URL, you can use tools like Paparazzi to get screen shots, and animations.

Provide an API to your visualization. Then use it. Make sure your API is robust. Provide multiple return values so users can choose how to use it with performance in mind. Never make the user make extra calls if you don’t have to.

Google quote: Trying things is cheaper than deciding to do them or not.

A note on complexity and technology: If you have a few hundred items, just use HTML. If you have a few thousand, try flash. If you have tens of thousands of items, us a java applet. If you have more, use a thick client. Each of these technologies increases your download time, so decide wisely.

One thing you can do to decrease download time is to not give all the data right away, either feed it to them slowly, or have it start gathering real time data only after starting.

Getting data, web scraping.
A company called Every Block is looking for data on the web that can be used to tell you about your block. They are gathering similar data as Onvia, but may not know how valuable it really is.

There is allot of free data out there, you can go get it and visualize it for free.

Tufte tells of the idea of getting rid of chart junk. His ideas are great for static images, but with dynamic visualizations they are harder to use. Getting rid of junk still works. Don’t show crap on your visualization that is not needed. Make each chart unique to show the data it is designed for. You should rarely use general purpose visualizations.

One last thing, build visualization to answer a question. But also build them to ask a new question. If it does not make the user play, they will not learn, and insight will not be found. If the visualization presents an answer, and does not ask another question, the user will not play.

Stuff for me to do more research on:
Technology, Flare is like flex...
Microforms, mofo…
Atom feed standard…
Visualcomplexity.com…

Borchardt Projects