Getting Ruby-WordNet working on Mac OS X

April 5th, 2009
This took me longer than it should have, so I thought that I would share.  As always, YMMV.

Install BDB

Download BDB: http://www.oracle.com/technology/software/products/berkeley-db/index.html

cd db-4.*
cd build_unix
../dist/configure
make
sudo make install

Download the Ruby BDB bindings: http://moulon.inra.fr/ruby/bdb.html

cd bdb-0.6.*
ruby extconf.rb --with-db-lib=/usr/local/BerkeleyDB.4.7/lib/ --with-db-include=/usr/local/BerkeleyDB.4.7/include
make
sudo make install

Install Wordnet and Ruby-WordNet

Download and install Wordnet: http://wordnet.princeton.edu/obtain

Download and install the Ruby-WordNet gem: http://www.deveiate.org/projects/Ruby-WordNet

Go to the location of the installed gem (because I use MacPorts, mine was /opt/local/lib/ruby/gems/1.8/gems/wordnet-0.0.5) and run:

sudo ruby convertdb.rb
sudo mv ruby-wordnet /opt/local/share/
sudo chmod 777 /opt/local/share/ruby-wordnet/*

I don't love that last chmod, but I think the Ruby gem needs to be able to add to the database files.

For wordnet documentation, run 'gem server' can go to http://localhost:8808 in your browser.

Hope this helps!

SelectorGadget Released

February 27th, 2009

I released SelectorGadget yesterday and was tickled to see it go to the top of Hacker News and get a bunch of tweets.  I'm very pleased that people are finding it exciting, and I look forward to seeing how it gets used!

Free inline JavaScript JSON editor

February 17th, 2009

I'm pleased to release a simple inline JSON editor that Kyle Maxwell and I wrote a little while ago. It features a structure editor, undo and redo, and a plain text view. Here is a live example:

Read the rest of this entry »

How to Setup Simple Automated Database Backups

January 8th, 2009

So, your website is in a personal SVN repository or on GitHub. Deployment is as simple as checking out your website on the server with capistrano or by hand. But what about your database? How do you back that up? Maybe your VPS or shared server has remote backups nightly - that's great for a restore - but you could still lose a day or more of data from your database.

Here's a super easy off-site backup option.

Read the rest of this entry »

Adding and removing remote git tags

December 4th, 2008

Just as a cheat sheet since this seems to be harder than it should be, here is how you add and remove remote tags in git. If I'm missing something, please comment!

Adding a tag:

  • git tag tag_name
  • git tag -l
         Should show your new tag.
  • git push origin --tags or git push origin :tag_name
         Because git push doesn't push tags.

Removing a tag:

  • git tag -d tag_name
  • git tag -l
         Should no longer show your tag.
  • git push origin :refs/tags/tag_name
         Because git push --tags doesn't push deleted tags.

Thanks 知易行难!

Geek tourism wiki or website?

December 1st, 2008

Now that I'm doing independent development and consulting, I'm more free to travel than I was previously.  I am in Philadelphia at the moment and so earlier today I posted a message on Hacker News inquiring about geeky activities to be had around the area.  I only got one reply, but it was a good one, and I enjoyed geeking out and seeing the remains of ENIAC.

All this has got me thinking about geek travel in general.  When I was in Seattle a few months ago I made a similar online post and received some suggestions, but I suspect that only a few people saw these posts and fewer actually knew the areas.  What I'd really like is a comprehensive geek travel site or wiki.  Does anyone know about such a thing?  Do you think I should make one?

Edit: From the comments, NuNomad looks very cool!  Thanks Clayton!

Orders of complexity: me, iPhone, ENIAC

December 1st, 2008

Orders of complexity: me, iPhone, ENIAC.

I explored the University City area of Philadelphia today and saw a surviving part of ENIAC in the Moore Building at UPenn.

CSSEvolve: guided stylesheet evolution

November 26th, 2008

Saw an interesting post today by John Resig about JavaScript-based genetic A/B testing. Very cool prospect. I hope that Greg Dingle releases it as open source.

This reminded me of a project that I did a while ago and never really released. It's called CSSEvolve and it uses a traditional blind watchmaker / user-driven genetic algorithm to drive CSS changes on a site of the user's choosing. Basically, a set of mutated CSS variants are produced, the user selects changes that he or she likes, the algorithm randomly combines those changes through crossover and mutation, and the process continues. It works best on heavily CSS-based sites, and isn't perfect, but it's a fun example of actual "intelligent design" on the part of the user to guide an evolutionary process.

I would enjoy people's thoughts on how this could be an actual tool instead of an experimental toy.

Replacement for script onload in IE

November 23rd, 2008

Firefox and Safari support an onload event for SCRIPT elements.  That is, you can dynamically add a new script to a page, set its onload event to fire a callback, and know when the script has been successfully loaded.  You would want to do this because when you include a bunch of new SCRIPT tags in a page, there are no guarantees in what order the browser will decide to evaluate them, thus making dependencies among the scripts difficult to resolve.  Using onload to chain the script additions is one solution to this, however, Internet Explorer doesn't seem to support onload in SCRIPT tags.  Here is a workaround:

Read the rest of this entry »

Is there a Google penalty for mashups?

November 22nd, 2008

Edit: Since writing this, my traffic has started to rise again.  I'm not sure if this is due to my contacting Google, or just a natural progression of events.

Since leaving my previous job I've been working on bunch of different projects.  The first one to be released was RecreationParks.net, a data mashup site combining naming information from the USGS with geocoded Wikipedia articles (thanks geonames.org!), Flickr photos, web searches, weather, maps, and user-submitted park activity information and links.  The idea was to get some search-based traffic from the long tail of regional web searches by making a parks page for every public park in the United States (around 60k in the USGS data set).  Most of these parks are tiny, but I figured someone might still care to search for them, and I tried to bring in as much smarts and data to mashup as I could, given my development timeframe of around a week and a half.

The initial traffic was nothing to write home about, but the derivative was positive, and so was I.  I had posted a link on Hacker News and a few other sites, wrote a Rails sitemap controller, and submitted the sitemap to Google.  I even outsourced some modest work to my online virtual assistant, who did a very good job building up link exchanges.  (More on my experiences with an online VA in a future post.)

After the HN traffic fell off, natural search traffic took over and grew slowly. Over the time span of the above graph, search sent 3,655 total visits via 3,385 keywords, for parks all over the country. That's pretty decent breadth of exposure. Google started to index through the sitemaps (around 5,000 pages so far) and traffic was small but growing, with pages on RecreationParks.net showing in the top 10 results for a large range of very niche keywords. Then, probably due to a ranking adjustment, traffic suddenly fell at Peak A. Soon after it started to rise again with a nice trend line, only to fall again precipitously at Peak B to practically nothing. Given that the vast majority of the traffic was coming via Google, these changes should be explained by changes at those times in Google's view of the site.

I think that the fall after Peak A was the result of the "newness" of the site wearing off, causing it to decrease in the Google rankings. What I find strange is that this happened twice. I'd expect the Peak A adjustment, but I was surprised (and disappointed) by the fall after Peak B. It seemed like users were getting something out of the site, some comments were left, and quite a few parks received user feedback, then thunk the site fell out of almost every Google listing in which it had previously been in the top 10. Is Google penalizing mashup-based sites that re-mix existing content? Did the site get (falsely, I'd claim) classified as duplicate content or spam? What do you think? Why would I see such a drastic reduction in traffic?

Since this was just a side project, I'm content to leave it up and wait and see if it slowly makes its way back into the listings. Granted, this site doesn't have much original content, but I still think it's useful in providing results on very niche and under-served searches. It'll be interesting to see how the ranking changes in the future and to try to infer from these changes an understanding of how Google does their ranking. This much is clear: one can't rely on Google for all of their traffic, and getting into search listings is harder than I thought. Come on, big G, give me some love!

Greetings, take two.

October 20th, 2008

Well, I'm back after being away from blogging for quite a while.  I recently left my job to work on my side projects full time.  This was a tough decision, but also something that I have wanted to do for a long time.  I'm going to use this space to chronicle what I learn while attempting to make a living on the Internet, while starting my own business, and while taking the time to explore what excites me.  I live in San Francisco.  Please feel free to contact me if you're interested in the same things.

8 Technologies That You Must Know About When Going Into A (Technical) Startup Interview

January 28th, 2008

Recently, I was advising a friend about startup interviews and came up with this list of technologies that you simply must know about when going into a technical startup interview. You don't need to be an expert in every one of these, but I think you should be aware of their existence and their high-level overview.

So, in no particular order...

  1. Most web applications are database driven, so know about database scaling and performance. Also: Memcached (distributed memory cache), caching proxies like squid, and caching techniques in general.
  2. Machine learning and data mining techniques -- at least an understanding of their potential. There is way too much to go into here, but play with the open source package called Weka. Also check out the excellent introductory, hands-on book Collective Intelligence by Toby Segaran.
  3. Lucene, an open source search engine.
  4. If you're applying (or working at) an interesting startup, you'll probably have large quantities of data to process, so know about Hadoop, the open source answer to Google's MapReduce paradigm. Consider in combination with Amazon's EC2 (see below).
  5. The Google File System (GFS) and Google's BigTable. These projects represent the current cutting-edge in data storage for scalable web applications. But if you want to use them, you'll have to join Google or use one of these projects that offer some (but not all) of their features: Amazon's S3 and SimpleDB (see below), MogileFS, Global File System, and Hadoop's HDFS file system with HBase acting as BigTable, but HBase may not be ready for prime-time just yet. There are, of course, other solutions of varying complexity as well.
  6. Amazon web services: S3 (backups, reliable data store; data archive; file serving), EC2 (virtualized, scalable utility computing; file processing; server environments -- and you should know something about machine virtualization in general as well), and to a possibly lesser (and unproven) extent SimpleDB (scalable database replacement for some types of applications -- you should have experience with MySQL and the SQL language too).
  7. Ruby on Rails -- even if you're not working in Ruby or deploying a Rails app, Rails is a powerful environment for rapid prototyping and experimentation, plus a very marketable skill in the current climate.
  8. Almost without saying: the obvious frontend interface technologies of HTML, CSS, JavaScript, and Flash/Flex.

All of the above are fairly language agnostic. You should know a couple of programming languages quite well, preferably one scripting language (probably one of Ruby, Perl, Python, or PHP), and one 'harder' language such as Java or C++. Be prepared to write code and answer questions in your chosen languages.

When interviewing at a startup, or any place really, make sure to a) explain your thoughts when solving problems (don't just think to yourself for 5 minutes), b) talk about what excites you (technologically and otherwise) and your awesome side-projects, c) be willing to talk about the flaws as well as the strengths of technologies, d) know something about the technological area of the startup, and e) actually know the subjects you proclaim to understand on your resume.

Everyone is hiring right now (including my employer)! So read up, do some side projects, and good luck!

What do you think about this list? Please suggest technologies and links that I missed!

The Best Color Manipulation Tools

June 25th, 2006

Here are some of my favorite online tools for exploring and manipulating color. What tools do you use?

Color Blender — Very convenient tool to blend any two colors with a varying number of midpoints

A list of colors by name and color code — for those of us who are color blind, this is very helpful for looking up colors by name.

Color Code Chooser — tool helps you manipulate colors in a number of very helpful ways.

Human Computation

April 11th, 2006

A few days ago I attended a talk by Luis von Ahn from CMU. Luis von Ahn is one of the creators of the ESP Game and Peekaboom, both interactive, multiplayer online games that harness human computing power while also being entertaining. These games get people to help label images, generating data that will ultimately be used to make better image search engines and better computer vision image analysis and segmentation algorithms. Basically, Amazon Mechanical Turk got it wrong: fun is a better motivation than money.

Luis von Ahn opened his talk by saying that people spend many millions of human hours on solitaire each year, and that it would be useful if even a small fraction of that time could be harnessed to get people to play games that are also useful. If this is his goal, he has succeeded — some people spend over 40 hours a week on the ESP Game, which has been very successful.

Luis also presented a general approach for turning computationally hard pattern recognition problems into two player games and suggested that many problems can be solved in this way. He is currently thinking about such things as language translation and common-sense knowledge collection.

While it is cool that the ESP Game, if adopted by a major gaming site like Yahoo! Games, could label most of the web’s images in just a few months, the most exciting thing for me is the wealth of training data that this would generate for researchers to make better computer vision algorithms. This would also be true for things like language translation and common-sense knowledge collection — these would empower new algorithms.

Luis ended by pointing out that The Matrix got it all wrong: we’re useless as batteries, but we make great pattern recognition subroutines. That’s why the computers will need to keep us around, at least for the time being. They keep us entertained, and we compute for them. Oddly enough.

Spore

March 16th, 2006

Spore ScreenshotIf you haven’t seen this, you absolutely have to check out this video (or the more complete version) about Will Wright’s new game, Spore. In Spore, you start life as a microorganism and evolve to the point of terra-forming whole galaxies and creating interstellar civilization. The scope of this game is immense, and this game is exciting on so many levels, not least of which is its obvious giant leap forward for game design, artificial life, and procedurally textured landscapes, environments, and worlds. I’m blown away.

I think one of Spore’s largest advancements is its leverage of other player’s content to bootstrap your world. Instead of requiring the game developers to come up with varied designs for thousands of species of life forms or civilizations, Spore finds content created by other players that will work well with your needs, and brings those to you in the form of tools or buildings to be purchased, other life forms, and alien races to encounter. This is brilliant. The game can only get more advanced and varied as players use it.

Spore’s worlds and creatures are procedurally generated. I previously wrote about procedurally generated environments, but I’ve never seen anything like this before. You can create creatures by combining many different parts, each of which has functionality. You can also reshape and mold parts as if they were clay. Then, the system analyzes the morphological structure of your newly created creature and figures out how it might move — how it should walk, fight, eat, mate, and more. The generated movements are plausible and visually pleasing. No motion capture or hand-animation required.

Spore tackles an incredible scale. I watched the video, and at every stage I thought, “wow, that’s a great game!”, then I found out that what I had seen was just the tutorial/prerequisite for the next, even larger stage of game play.