<?xml version="1.0" encoding="iso-8859-1"?>
<!-- generator="FeedCreator 1.7.2" -->
<rss version="2.0">
    <channel>
        <title>MedWorm Tags: big data</title>
        <description>MedWorm provides a medical RSS filtering service. Over 6000 RSS medical sources are combined and output via different filters. This feed contains the latest medical blog items that have been tagged with 'big data'.</description>
        <link><![CDATA[http://www.medworm.com/rss/search.php?qu=%22big+data%22&t=%22big+data%22&r=Exact&o=d&f=tag]]></link>
        <lastBuildDate>Sat, 03 Sep 2011 02:54:19 +0100</lastBuildDate>
        <item>
            <title>If you have too much data, then “good enough” is good enough</title>
            <link>http://www.medworm.com/index.php?rid=4902616&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FUPd_4dJzp1g%2F</link>
            <description>Tweet	
	I would suggest that all my friends in the world of bioinformatics read this fabulous article by Pat Helland. Pat&amp;#8217;s on of the leading experts in distributed transactions and knows more about databases than most of us put together. His ACM article goes into some the tradeoffs and changes in mindset that need to me made when working with data that changes and comes from different sources, and all so o ften has ambiguity associated with it. It also tells you a little but about the differences in SQL and NoSQL systems when it comes to transaction semantics and in a way that meets complete sense. 
	Perhaps the most interesting part of the article was the section on &amp;#8220;Mulligan stew&amp;#8221; where we also provides the example of building a heterogeneous catalog. A product catalog...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4902616</comments>
            <pubDate>Mon, 06 Jun 2011 19:16:36 +0100</pubDate>
            <guid isPermaLink="false">4902616</guid>        </item>
        <item>
            <title>Data, software, and money</title>
            <link>http://www.medworm.com/index.php?rid=4876473&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FNumKmYkZXMY%2F</link>
            <description>Tweet	
Steve O&amp;#8217;Grady has written a blog post about a recent talk he gave at OSBC. In the post he welcomes the Age of Data. The talk covers two topics of great interest, software and data. In the context of the life sciences I have worked on both the &amp;#8220;data as a product&amp;#8221; side and on the packaged software side. He notes that none of the top &amp;#8220;software&amp;#8221; companies in the world are of recent vintage. These are companies making money from selling software (a really difficult business in the sciences). He argues that data driven products is where the market is. The success of Google and others is a testament to this, but in the sciences the entire model of data as product has never worked. I would argue that this is partly cause we&amp;#8217;ve always sold the data itself ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4876473</comments>
            <pubDate>Sun, 29 May 2011 00:38:14 +0100</pubDate>
            <guid isPermaLink="false">4876473</guid>        </item>
        <item>
            <title>The data is the question</title>
            <link>http://www.medworm.com/index.php?rid=4684638&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FrSNtgf1kFnE%2F</link>
            <description>Tweet	
	I have long channeled Jeff Jonas and his ideas around on data finds data. His recent blog post on the data being the query extends some of those thoughts. I find this trend fascinating, although I favor the just in time data approach, since not all information needs to be acted upon instantly, but the broader point holds. I had a similar discussion with Richard Durbin recently around data first science, where we discussed collecting data and then querying it to generate hypotheses and to see how the new data impacted existing knowledge.
	It&amp;#8217;s going to be interesting how today&amp;#8217;s life science data systems evolve. The data-driven approach which I talk about a lot is one that is essential for modern biological research (saw a great talk on this by Joel Dudley recently); usi...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4684638</comments>
            <pubDate>Tue, 05 Apr 2011 13:54:59 +0100</pubDate>
            <guid isPermaLink="false">4684638</guid>        </item>
        <item>
            <title>Something to ponder</title>
            <link>http://www.medworm.com/index.php?rid=4653491&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FX92wTUvET3w%2F</link>
            <description>Tweet	The scale of modern life science research, where scale is not just about data volume, but also about rate of change, number of users, geographic scale, etc means that resources have to look at how they provide services differently and, more importantly, funding agencies and philanthropists have to decide where to draw the line. Is this an opportunity for commercial efforts? Is the market ready to do this, or are they willing to live with overall inefficiencies and limitations? Is there a tiered model that would be acceptable.
	Recent discussions and observations of what various companies and orgs are doing leads me to believe that we need to really think hard about overall efficiencies and consider the value of time. More later (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4653491</comments>
            <pubDate>Tue, 29 Mar 2011 07:03:03 +0100</pubDate>
            <guid isPermaLink="false">4653491</guid>        </item>
        <item>
            <title>Practical machine learning and scaling data platforms</title>
            <link>http://www.medworm.com/index.php?rid=4552124&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F3JkrNEoA2Ks%2F</link>
            <description>Tweet	A couple of great posts on the Metamarkets blog recently that might be of relevance to the bioinformatics crowd. The first one, by Mike Driscoll, talks about lessons for building a petabyte data platform. Their four guiding principles
	
	Experiment often, fail fast
	Keep things simple to scale well
	Keep things modular to accommodate change
	Avoid undifferentiated heavy lifting
	
	I still feel that the data systems we have in the life science domain aren&amp;#8217;t doing enough to learn good lessons from the web world, which is embracing change, complexity and scale, and even small teams, like the one at Metamarkets is able to do a lot with less, due to the kinds of principles mentioned in the post. One of the problems I see in informatics is a lack of appreciation for some of the skill...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4552124</comments>
            <pubDate>Sat, 05 Mar 2011 19:38:59 +0100</pubDate>
            <guid isPermaLink="false">4552124</guid>        </item>
        <item>
            <title>Data and a product mindset</title>
            <link>http://www.medworm.com/index.php?rid=4477981&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F61xj8hYcsnw%2F</link>
            <description>Tweet	Lots of interesting discussion around the web on the rise of data-driven startups and product teams. Russell Jurney&amp;#8217;s blog post on Analytic Product Teams has picked up a lot of press, and in general that is a topic that the LinkedIn SNA team talks about quite a bit. Bradford Cross has eloquently covered Research-driven startups and more recently, this comes up in a Dataspora article on mining big data
	What strikes me about this, especially light of Neil&amp;#8217;s recent post on data scientists and my own past is that in some ways the social science space is going through a fascinating discovery about the value of data-driven products, something that some of the web giants have been doing for a long time. The difference now is that (a) there is an abundance of data, data sources ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4477981</comments>
            <pubDate>Mon, 14 Feb 2011 14:32:43 +0100</pubDate>
            <guid isPermaLink="false">4477981</guid>        </item>
        <item>
            <title>Algorithms running day and night</title>
            <link>http://www.medworm.com/index.php?rid=4455410&amp;cid=t_200769_132_f&amp;fid=35006&amp;url=http%3A%2F%2Fnsaunders.wordpress.com%2F2011%2F02%2F09%2Falgorithms-running-day-and-night%2F</link>
            <description>Warning: contains murky, somewhat unstructured thoughts on large-scale biological data analysis
Picture this. It&amp;#8217;s based on a true story: names and details altered.
Alice, a biomedical researcher, performs an experiment to determine how gene expression in cells from a particular tissue is altered when the cells are exposed to an organic compound, substance Y. She collates a list of the most differentially-expressed genes and notes, in passing, that the expression of Gene X is much lower in the presence of substance Y.
Bob, a bioinformatician in the same organisation but in a different city to Alice, is analysing a public dataset. This experiment looks at gene expression in the same tissue but under different conditions: normal compared with a disease state, Z Syndrome. He also notes ...</description>
            <author>What You're Doing Is Rather Desperate</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4455410</comments>
            <pubDate>Wed, 09 Feb 2011 03:41:21 +0100</pubDate>
            <guid isPermaLink="false">4455410</guid>        </item>
        <item>
            <title>Jeff Hammerbacher on evolving analytical platforms</title>
            <link>http://www.medworm.com/index.php?rid=4442078&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FfIV7JtZnFas%2F</link>
            <description>Tweet	This talk from Jeff Hammerbacher is worth a listen. Gives you a good history of enterprise data challenges and some of the reasons why Hadoop became a big deal so quickly and a good sense of the evolving Hadoop ecosystem
	
Jeff Hammerbacher on Evolving a New Analytical Platform &amp;#8211; Orbitz IDEAS from Orbitz IDEAS on Vimeo (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4442078</comments>
            <pubDate>Sun, 06 Feb 2011 05:35:47 +0100</pubDate>
            <guid isPermaLink="false">4442078</guid>        </item>
        <item>
            <title>Data, networks and society</title>
            <link>http://www.medworm.com/index.php?rid=4327003&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FqMAAzjXOv1A%2F</link>
            <description>Tweet	I am fascinated by the entire field of social network analysis. Given all the data, however incomplete and noisy, at our disposal today, we can learn a lot about our behavior and habits. While some companies and organizations leverage these data to sell more targeted advertising and improving the relevance of search results, there is a whole slew of socio-economic and other metrics/trends that we can evaluate as well. 
	For example, Jake Hofman does some fabulous work on how to analyze social network dynamics. There are many others doing much the same. An example of some interesting work in this area comes to us from Marcel Salathé and co-workers at Penn State. Their paper on &amp;#8220;A High-Resolution Human Contact Network for Infectious Disease Transmission&amp;#8221; looks into the num...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4327003</comments>
            <pubDate>Sun, 09 Jan 2011 17:39:37 +0100</pubDate>
            <guid isPermaLink="false">4327003</guid>        </item>
        <item>
            <title>Hans Rosling’s “Joy of Stats”</title>
            <link>http://www.medworm.com/index.php?rid=4302236&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2Fpb9DlJ7Snls%2F</link>
            <description>Tweet	Hans Rosling is a great presenter on the value of statistics and he now has a BBC documentary to go with it (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4302236</comments>
            <pubDate>Fri, 31 Dec 2010 21:51:33 +0100</pubDate>
            <guid isPermaLink="false">4302236</guid>        </item>
        <item>
            <title>A few good links</title>
            <link>http://www.medworm.com/index.php?rid=4225532&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FEQGwfrXIYX4%2F</link>
            <description>Tweet	There&amp;#8217;s a lot of interesting stuff out there which I don&amp;#8217;t have time to blog about, so here are some links that I might end up blogging about later
	
	NASA, a little bit of hyperbole, but some cool biochemistry. You&amp;#8217;ve all seen the news. Here is thhe (paper in Science). A few interesting blog posts by Steve Betz, PZ Myers and Derek Lowe
	Science and gameplay. Phylo is to comparative genomics what Foldit is to structure prediction
	LinkedIn, the place for data scientists. Or so it seems, as they add Daniel Tunkelang to an excellent team of data geeks (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4225532</comments>
            <pubDate>Fri, 03 Dec 2010 06:56:31 +0100</pubDate>
            <guid isPermaLink="false">4225532</guid>        </item>
        <item>
            <title>Learning the hard way</title>
            <link>http://www.medworm.com/index.php?rid=4142924&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FQ4Z3IdCJ2QM%2F</link>
            <description>Tweet	Ben Black has a great blog post on GigaOm on scale-driven database architecture. There are two key messages there that I would like reiterative in the context of modern biology. The first comes right at the beginning of the post
	Scale breaks everything. Scale even breaks your assumptions about how best to store and query data. Scale does not care about your personal engineering preferences, or about SQL vs. NoSQL. The demands of rapid growth and ever-higher expectations for availability, performance, and cost efficiency force you to re-evaluate and re-imagine what you need, what is possible, and how to best achieve your businessscientific goals.
	The second message comes in right at the end
	Lost in all the debates about SQL vs. NoSQL, ACID vs. BASE, CAP, and all the rest is simply ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4142924</comments>
            <pubDate>Sat, 06 Nov 2010 18:41:39 +0100</pubDate>
            <guid isPermaLink="false">4142924</guid>        </item>
        <item>
            <title>Lessons from Swivel</title>
            <link>http://www.medworm.com/index.php?rid=4119476&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FPvDVvw4jMlA%2F</link>
            <description>TweetI&amp;#8217;ve written about Swivel in the past, but I never really got around to using it. Well Swivel is no more. Robert Kosara interviewed the co-founders about the rise and fall of Swivel (interestingly both had left Swivel prior to this news). Read the entire interview, but it reminded me about businesses and what might seem obvious with one thing doesn&amp;#8217;t translate as well to others. But in the end it seems to be there were a lot of mistakes in execution. Perhaps Swivel was not the kind of business meant to be just it&amp;#8217;s own business, but part of a larger operation. Perhaps they should have worked harder on the data sets they could get their hands on. I wasn&amp;#8217;t there, so can&amp;#8217;t say for sure and armchair backing is an easy exercise.
I also think that a part of the...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4119476</comments>
            <pubDate>Thu, 28 Oct 2010 06:10:01 +0100</pubDate>
            <guid isPermaLink="false">4119476</guid>        </item>
        <item>
            <title>The data danger zone</title>
            <link>http://www.medworm.com/index.php?rid=4031424&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FYXvzgVBdqOg%2F</link>
            <description>TweetDrew Conway has come up with a Data Science Venn Diagram. My favorite bit from the diagram is the &amp;#8220;danger zone&amp;#8221;. Drew positions the danger zone as follows
Finally, a word on the hacking skills plus substantive expertise danger zone. This is where I place people who, “know enough to be dangerous,” and is the most problematic area of the diagram. In this area people who are perfectly capable of extracting and structuring data, likely related to a field they know quite a bit about, and probably even know enough R to run a linear regression and report the coefficients; but they lack any understanding of what those coefficients mean. It is from this part of the diagram that the phrase “lies, damned lies, and statistics” emanates, because either through ignorance or mali...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4031424</comments>
            <pubDate>Tue, 05 Oct 2010 09:24:15 +0100</pubDate>
            <guid isPermaLink="false">4031424</guid>        </item>
        <item>
            <title>Data and the right people</title>
            <link>http://www.medworm.com/index.php?rid=4025735&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FAWCxEW662Yk%2F</link>
            <description>TweetTo answer the right questions you need the right people
That&amp;#8217;s the last line from a blog post by Steve O&amp;#8217;Grady. In Even with Big Data, it&amp;#8217;s difficult to hard to ask the right question, Steve points out that with large amounts of data, asking the right question is quite hard. His point, channeling Kevin Weil, is that with a lot of data, asking the right questions becomes critical. In the sciences this gets magnified, because the questions we ask are critical to developing new hypotheses and as a former colleague of mine always said, it&amp;#8217;s always about our point of view. In other words, framing questions is critical and the results we get depend on the questions we ask and how we are asking them. As we generate more and more biological data, our biggest challenge ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=4025735</comments>
            <pubDate>Sat, 02 Oct 2010 07:55:24 +0100</pubDate>
            <guid isPermaLink="false">4025735</guid>        </item>
        <item>
            <title>Ilya Grigorik on machine learning and Ruby</title>
            <link>http://www.medworm.com/index.php?rid=3999182&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F5yRMHgkwQ4o%2F</link>
            <description>TweetIlya Grigorik has a great set of slides up on slideshare from his recent talk at the 2010 Golden Gate Ruby conference. The talk called Intelligent Rudy + Machine Learning is the kind of presentation I absolutely love. He talks about the what, the why, the trends, and relevant tools.
Over the past few years, I&amp;#8217;ve become fascinated with machine learning. For the longest time, from my perspective, machine learning was something for academics to play around with models without significant real world utility. The availability of data and computing has changed that, and today I am a convert to the power of machine learning, and wish we pushed the envelope more, at least in the life sciences. Some of this change in opinion is due to the adoption of machine learning in non-academic sett...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3999182</comments>
            <pubDate>Fri, 24 Sep 2010 07:54:22 +0100</pubDate>
            <guid isPermaLink="false">3999182</guid>        </item>
        <item>
            <title>Machine learning at scale at Google</title>
            <link>http://www.medworm.com/index.php?rid=3982087&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FCD6ILq6hA48%2F</link>
            <description>Greg Linden points us to a great paper (pdf) on machine learning by folks at Google that was presented at LADIS &amp;#8217;10 (I&amp;#8217;d love to go some day)
The presentation covers Sibyl, a &amp;#8220;system for large scale machine learning&amp;#8221; and about Parallel Boosting, an iterative approach that does well at predictions based on sparse data. The Boosting page says that the boosting approach is designed to work with semi-accurate rules of thumb (made me think of ligand pose scoring for some reason). As might be expected from a Google approach it is embarrassingly parallel and uses the following approach


(image from the talk PDF)
They also talk about how they leverage RAM, lots of cores and GFS (column store). Greg does a great job of covering some of those aspects. This method allows the ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3982087</comments>
            <pubDate>Sun, 19 Sep 2010 03:28:12 +0100</pubDate>
            <guid isPermaLink="false">3982087</guid>        </item>
        <item>
            <title>Data geeks and biology</title>
            <link>http://www.medworm.com/index.php?rid=3845239&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FShGFWOdSkgM%2F</link>
            <description>Image of Matt Wood



I&amp;#8217;ve had the luxury of working in some very interesting areas; large scale protein structure prediction, physics-based approaches to drug discovery, data management for all kinds of molecular profiling data, and high-scale distributed infrastructure. I also have had the fortune of meeting some of the brightest people in the world at their craft over the years. In particular, over the past couple of years, I&amp;#8217;ve met or observed some exceptionally bright people at the forefront of information retrieval and data mining. While there is a lot of naive, follow the latest trend, activity, there is also a lot of excitement. The web produces a lot of data, and many smart folks are trying to make sense of all that data. I am obviously biased, and can never really sto...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3845239</comments>
            <pubDate>Sun, 08 Aug 2010 18:29:47 +0100</pubDate>
            <guid isPermaLink="false">3845239</guid>        </item>
        <item>
            <title>Twenty queries</title>
            <link>http://www.medworm.com/index.php?rid=3787088&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FWYnan9rhGbA%2F</link>
            <description>Image via Wikipedia



I am reading up a lot of Jim Gray these days, so a lot of his ideas are quite fresh in my head. Also had an interesting discussion with Nancy Parmalee on Twitter about software, informatics, bench scientists and small labs. One thing that jumped out, and is hardly a surprise, is that for the most part, there is a huge disconnect between the data science, and the scientists who need to make use of the work done by data scientists (often bench scientists). I&amp;#8217;ve long argued that we neglect &amp;#8220;infrastructure&amp;#8221; software like data management systems, tracking systems, query systems, etc which all require well designed, scalable backends and should be treated like products, cause they are, even if they are home grown, or derived from open source software.
Thi...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3787088</comments>
            <pubDate>Mon, 26 Jul 2010 06:02:10 +0100</pubDate>
            <guid isPermaLink="false">3787088</guid>        </item>
        <item>
            <title>Data science, roles, and barriers</title>
            <link>http://www.medworm.com/index.php?rid=3746907&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2Fjesvqq0466A%2F</link>
            <description>In keeping with my recent data science theme, Ben Lorica has a nice post up on how to nurture data scientists. While his post focusses on data scientists in commercial organizations, the post has some very relevant points for bioinformaticians.
After working in companies both large and small, it&amp;#8217;s clear to me that the strict separation of tasks is the major obstacle faced by data scientists. The most common manifestation is the separation between data analysis and data management. In many large companies, most analysts/statisticians have to wait for data from a designated data warehousing team, and in a lot of cases they wait for data from multiple owners of different data warehouses.
Neil pointed to another bit
To nurture data scientists, companies need to focus more on culture and ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3746907</comments>
            <pubDate>Tue, 13 Jul 2010 07:01:54 +0100</pubDate>
            <guid isPermaLink="false">3746907</guid>        </item>
        <item>
            <title>Recommendation: Data-intensive text processing with MapReduce</title>
            <link>http://www.medworm.com/index.php?rid=3721904&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FJZKS2_mfuKI%2F</link>
            <description>Staying on my massive data processing theme here is a more practical post. In the world of large scale distributed processing, the original MapReduce paper will probably hold the most important position. Hadoop remains the most well known of all the MapReduce implementations, and is now a proven, battle-tested commodity. Tom White&amp;#8217;s book
is a great place to start if you have an interest in the framework itself, but the book I wanted to point out was Jimmy Lin&amp;#8217;s book on Data-Intensive Text Processing with MapReduce (there is a pre-production PDF of the book from the homepage)  and it&amp;#8217;s a great dive into algorithm design. The book talks about general algo design, indexing, graphs and a fabulous section on expectation maximization that is a must read for bioinformaticians w...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3721904</comments>
            <pubDate>Sat, 03 Jul 2010 05:21:03 +0100</pubDate>
            <guid isPermaLink="false">3721904</guid>        </item>
        <item>
            <title>Oozie</title>
            <link>http://www.medworm.com/index.php?rid=3714365&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FCGWzRowuY50%2F</link>
            <description>I missed the talk, but at this week&amp;#8217;s Hadoop Summit, Yahoo talked about Oozie, their workflow engine for Hadoop. Oozie is open source, and allows you to manage jobs between HDFS, Pig, and MapReduce.
Oozie looks very interesting indeed. Workflows are arranged in a Direct Acyclic Graph, and you can make decisions, fork and join nodes, etc. The kind of workflow system that could make some bioinformatics pipelines very interesting to implement. The figure on the Oozie design page suggests one possible workflow

Related articles by Zemanta

5 years later, Hadoop has matured (developer.yahoo.net)
Yahoo adds security and workflow management to Hadoop (infoworld.com) (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3714365</comments>
            <pubDate>Thu, 01 Jul 2010 06:26:31 +0100</pubDate>
            <guid isPermaLink="false">3714365</guid>        </item>
        <item>
            <title>Massive data</title>
            <link>http://www.medworm.com/index.php?rid=3714366&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FAn11TA9gOs8%2F</link>
            <description>Facebook
36 PB of uncompressed data
2250 machines
23,000 cores
32 GB of RAM per machine
processing 80-90TB/day
Yahoo
70 PB of data in HDFS
170 PB spread across the globe
34000 servers
Processing 3 PB per day
120 TB flow through Hadoop every day
Twitter
7 TB/day into HDFS
LinkedIn
120 Billion relationships
82 Hadoop jobs daily (IIRC)
16 TB of intermedia data
2 engineers
These are just some examples from Hadoop Summit. Many of these are production systems, others research systems. Also discussed were massive graphs (trillions of edges), insights from TBs of data ingested daily, etc. All held by a common thread, the Hadoop ecosystem (Hadoop is a lot more now than just an implementation of MapReduce). The next time I hear life science people complain about data volumes, shared storage, etc, I ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3714366</comments>
            <pubDate>Wed, 30 Jun 2010 14:29:23 +0100</pubDate>
            <guid isPermaLink="false">3714366</guid>        </item>
        <item>
            <title>The Biological Data Scientist</title>
            <link>http://www.medworm.com/index.php?rid=3687300&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FSzJ8NYXeI7E%2F</link>
            <description>Image via Wikipedia



Data has been in the news again lately. It&amp;#8217;s a data-centric world, and it seems we can&amp;#8217;t quite enough. Whether it&amp;#8217;s the Cornucopia of Corpora at The Infochimps or all the patent data that Google just unleashed, or the Guardian Open Platform or the 1000 genomes project (on Amazon S3). It&amp;#8217;s pretty clear that data is sexy, and to some degree overhyped (it&amp;#8217;s not quite as simple as Data &amp;#8211;&amp;gt; WIN!!!), but I, and others, clearly believe that data is important, and more, easier access to data can only be a good thing.
Data is a constant theme on bbgm, but there&amp;#8217;s something I am beginning to realize more clearly. It&amp;#8217;s not about the specific implementations or technology choices we make. Those are important, but data science is ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3687300</comments>
            <pubDate>Wed, 23 Jun 2010 06:12:44 +0100</pubDate>
            <guid isPermaLink="false">3687300</guid>        </item>
        <item>
            <title>A proposal for scientific data management</title>
            <link>http://www.medworm.com/index.php?rid=3573866&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FB5SkvTRKfu0%2F</link>
            <description>One of the best posts ever on data management in the sciences comes to us via Titus Brown. The data management plan he proposes is one that is battle tested in many laboratories across the world. 
Neil captures the actionable items from the above blog post 
I am still laughing. I should be crying.
Related articles by Zemanta

Summer Course: Analyzing Next-Generation Sequencing Data (softwarecarpentry.wordpress.com)
BEACON Funded! (softwarecarpentry.wordpress.com) (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3573866</comments>
            <pubDate>Tue, 18 May 2010 04:18:58 +0100</pubDate>
            <guid isPermaLink="false">3573866</guid>        </item>
        <item>
            <title>Utopia?</title>
            <link>http://www.medworm.com/index.php?rid=3556282&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FOX2IN5IiNdM%2F</link>
            <description>I recently wondered if there would be a time that scientific data APIs would get the kind of developer excitement that the Twitter API gets. To some extent that is wishful thinking. After all the type of data you get via Twitter and the type of data you get from next-gen sequencing are quite different and require a different level of expertise and understanding. But I do believe that smart developers especially those with a data mining bent can learn enough biology to really help and extend the field. 
I&amp;#8217;ve been in enough meetings recently where there has been quite a bit if contention of roles in the life sciences. Some of it is semantics, but a lot of it is real. We have algorithm developers, bioinformaticians, biologists, software developers and perhaps roles that I am forgetting,...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3556282</comments>
            <pubDate>Tue, 11 May 2010 22:33:50 +0100</pubDate>
            <guid isPermaLink="false">3556282</guid>        </item>
        <item>
            <title>Reiterating the need for a data commons</title>
            <link>http://www.medworm.com/index.php?rid=3545571&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FwvJSg_qPkWo%2F</link>
            <description>Just read something that really gels with why I believe strongly in open data in science. The following paragraph comes from Effect Measure
As an epidemiologist it can take me years of hard work to collect data. I want to use that data and reap its benefits, both for public health and for me personally and my students and post docs. That doesn&amp;#8217;t mean I get to hoard them. It means that I have to use them in a timely way. I have an advantage over everyone else because I know the data better than they do and I have it before they do. But I don&amp;#8217;t have any ownership rights over it. If someone else can use my work, that&amp;#8217;s what science is all about. Making it available and accessible should be part of the culture of my discipline. It isn&amp;#8217;t, sad to say. But what should also...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3545571</comments>
            <pubDate>Fri, 07 May 2010 18:24:24 +0100</pubDate>
            <guid isPermaLink="false">3545571</guid>        </item>
        <item>
            <title>Data-driven research products</title>
            <link>http://www.medworm.com/index.php?rid=3526896&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2Feh2LjAImkOU%2F</link>
            <description>Bradford Cross writes about datasets and data-driven startups. The entire post is full of nuggets, but one bit jumped out at me. Brad writes
Data preprocessing, transformation, and systems engineering are normally the bulk of the work for data and research driven problems &amp;#8211; all the more so when you are collecting data from disparate sources rather than using your own internal data
When I wrote about Atul Butte&amp;#8217;s talk at Sage Congress, this was where I was coming from. Bioinformaticians spend a lot of time dealing with data, and the transformation, etc needed to do with data coming from different sources. When you have a lot of publicly available data around, you have to be very good at the data handling and systems engineering. But once you overcome those barriers, you can star...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3526896</comments>
            <pubDate>Sun, 02 May 2010 23:39:06 +0100</pubDate>
            <guid isPermaLink="false">3526896</guid>        </item>
        <item>
            <title>Cassandra replication and consistency</title>
            <link>http://www.medworm.com/index.php?rid=3519624&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FAeGRxNfbGnM%2F</link>
            <description>The other day, I had the chance to see a great talk by Benjamin Black on how Cassandra handles replication and consistency. Slides by themselves do not do the talk justice, especially as there was a lot of great Q&amp;A as well, but I think you&amp;#8217;ll get a sense of how a good partition tolerant distributed system is set up.
Introduction to Cassandra: Replication and Consistency
View more presentations from benjaminblack.

Related articles by Zemanta

Scaling Twitter with Cassandra (slideshare.net)
Cassandra reading list (spyced.blogspot.com) (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3519624</comments>
            <pubDate>Fri, 30 Apr 2010 13:31:22 +0100</pubDate>
            <guid isPermaLink="false">3519624</guid>        </item>
        <item>
            <title>Abstractions</title>
            <link>http://www.medworm.com/index.php?rid=3505071&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F-qLG_UxCPrY%2F</link>
            <description>I can&amp;#8217;t quite put my finger on it, but something is amiss. On the other hand, something tells me that we are closer to an idea of a world with tools and components that can be assembled together by smart people in various ways. You could use something like GenePattern or Galaxy as a framework to embed these tools, or use Pipeline Pilot or Taverna. To build good science data platforms, we need to leverage abstractions. What is key is making sure that every layer of abstraction can successfully read and write from the one below and with other entities in the same layer. You have the algorithm developers, the platforms, the APIs and eventually the applications and analysis tools. You need a rich ecosystem of algorithm developers, data scientists (aka bioinformaticians) and software deve...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3505071</comments>
            <pubDate>Mon, 26 Apr 2010 13:00:52 +0100</pubDate>
            <guid isPermaLink="false">3505071</guid>        </item>
        <item>
            <title>We have the data</title>
            <link>http://www.medworm.com/index.php?rid=3502928&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FjqaFjwIK1Z0%2F</link>
            <description>At the Sage Congress, one of my favorite talks was one that Atul Butte gave on using publicly-available data. I have long thought that actually performing microarray gene expression experiments would go away, since there will be sufficient compendia and public data available that can be used for doing all kinds of useful science. Atul&amp;#8217;s talk drove that point home with some authority. His premise was that there is a lot of public data out there and while it may not always be perfect, smart people can use this data to do a lot of interest things, such as identifying data-driven candidate genes. In other words, use the data to find candidates and then drill down into the science. His other example was work by Joel Dudley (who happens to be sitting next to me as I type this), creating a ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3502928</comments>
            <pubDate>Sat, 24 Apr 2010 19:19:39 +0100</pubDate>
            <guid isPermaLink="false">3502928</guid>        </item>
        <item>
            <title>How not to build databases for biology</title>
            <link>http://www.medworm.com/index.php?rid=3467953&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F3DvNX9dRz3w%2F</link>
            <description>Maria Hodges has a fantastic post about building (bad) biological databases, a must read. The only point I might have a little nit about is Tip #5, Totally trust your automated systems.  Little because biological data does often need some curation due to the nature of the beast, but I would argue that some of the largest data systems in the world are completely, or near completely automated, so it&amp;#8217;s possible. (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3467953</comments>
            <pubDate>Wed, 14 Apr 2010 02:43:08 +0100</pubDate>
            <guid isPermaLink="false">3467953</guid>        </item>
        <item>
            <title>Moving on from what’s comfortable</title>
            <link>http://www.medworm.com/index.php?rid=3454100&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FzBI7kMzsSc0%2F</link>
            <description>Image via Wikipedia



I had a chance to hear Ed Lazowska talk at the Microsoft Cloud Futures 2010 event (I followed Ed, always a tough act). Ed talked about data driven science, i.e. science driven more by data than just by compute cycles. I come from a world of simulation-oriented science, and while I still love simulations, I&amp;#8217;ll admit that data-driven science is more broadly applicable and is going to drive science for the immediate future. That is why, even in 100% academic research environments, good software development and data management are going to be completely critical. Flat files and Excel are not going to cut it. I&amp;#8217;ve heard the argument that biologists are always going to use Excel and that came up at the NIH meeting I was at last week as well, but IMO that&amp;#8217;...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3454100</comments>
            <pubDate>Fri, 09 Apr 2010 05:32:11 +0100</pubDate>
            <guid isPermaLink="false">3454100</guid>        </item>
        <item>
            <title>Freebase Gridworks: The data curation tool</title>
            <link>http://www.medworm.com/index.php?rid=3412549&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FJ23P0DOEMpI%2F</link>
            <description>Image via CrunchBase



I found out about Freebase Gridworks through a post by Jon Udell. In the post, Jon refers to two screencasts on this yet unreleased product. In the Freebase blog post, they quote the announcement from the mailing list. The important bits
We at Metaweb strongly believe that Freebase can be helpful not only as a giant repository of heavily curated and interconnected data but also as a way to help people cleanup and integrate their own datasets by aligning their data with a shared substrate.
Jon adds to this after seeing the screencasts
As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it. As a n...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3412549</comments>
            <pubDate>Sun, 28 Mar 2010 14:00:25 +0100</pubDate>
            <guid isPermaLink="false">3412549</guid>        </item>
        <item>
            <title>Jealous of Geo (no not gene expression)</title>
            <link>http://www.medworm.com/index.php?rid=3387000&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FbkcTTVT2m5U%2F</link>
            <description>Image via Wikipedia



In my day job, I get to see a lot of innovative geo-related software and services, and the O&amp;#8217;Reilly Radar does a great job of tracking innovations in this space. SimpleGeo, WeoGeo, ESRI, Loki, Cloudmade, Quantum GIS, GeoCommons, etc are just some examples of companies/organizations/open source projects doing very interesting things around geospatial data of all kinds. There are a number of good open source efforts around geo-data and visualization, and I am almost certain I am missing a ton. These toolkits allow people to do interesting things. 
So where am I going with this? Somehow there seems to be a lack of similar interesting things with scientific data. Admittedly that is a gross generalization, but outside of things like Rich Apodaca&amp;#8217;s many project...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3387000</comments>
            <pubDate>Sun, 21 Mar 2010 06:50:25 +0100</pubDate>
            <guid isPermaLink="false">3387000</guid>        </item>
        <item>
            <title>Data democratized</title>
            <link>http://www.medworm.com/index.php?rid=3374317&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2Fc1De80F8vSM%2F</link>
            <description>In a brilliant piece entitled Big Data Is Less About Size, And More About Freedom, Bradford Cross talks about about the democratization of analyzing data at scale. As he so correctly points out, the data age has a lot to do with the cool things we can do with data today. Yes data sizes are getting large, but large is relative. I heard numbers today that make the output from many genomics centers sound like a walk in the park, but for the average lab, the average startup, increasing amounts of data are still only in the range of terabytes, not petabytes as some of us (like yours truly) like to talk about.
Brad talks about trends in computing and software that have allowed data-driven companies like Flightcaster to get to market faster. He breaks down these trends into three chunks

Storing ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3374317</comments>
            <pubDate>Wed, 17 Mar 2010 12:00:40 +0100</pubDate>
            <guid isPermaLink="false">3374317</guid>        </item>
        <item>
            <title>Don’t move that data</title>
            <link>http://www.medworm.com/index.php?rid=3370603&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FTkV3ISv4aFk%2F</link>
            <description>Times change. Last week I was at a local science event and the speaker talked about their data being in Seattle and their compute literally being diagonally across the country in Florida (something that sort of happened for various reasons). That is quite the distance for data to travel. It&amp;#8217;s even more for a lot of data to travel. As I commented when asked about solutions to that problem, my answer was &amp;#8220;don&amp;#8217;t move the data&amp;#8221;. Well it&amp;#8217;s true. Even with companies out there that help you move large quantities if data, the only good solution for data at this scale is to keep the data in one place and move the compute around. Cheaper, more efficient, and a better use of the network.
IMO, the days of moving data sets over the wire are long gone. You can move slices a...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3370603</comments>
            <pubDate>Tue, 16 Mar 2010 12:00:35 +0100</pubDate>
            <guid isPermaLink="false">3370603</guid>        </item>
        <item>
            <title>The sequencing market is beginning to shape out</title>
            <link>http://www.medworm.com/index.php?rid=3363772&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F2zc50UROtzQ%2F</link>
            <description>Dan Koboldt has a great post on the state of sequencing in 2010 (can we drop &amp;#8220;next-gen&amp;#8221; now?), and beyond I guess. It&amp;#8217;s certainly getting crowded out there, and it did look like that the major players were essentially fighting for the same space and share of the market, but based on what Dan says, that seems to be changing. I should add that I am not in the trenches, and my interests lie on the data management, analysis and infrastructure side of things, so can&amp;#8217;t comment on individual technologies per se. 
It&amp;#8217;s interesting to see how various players seem to be positioning themselves, although where folks end up and who survives will depend on all kinds of factors. The scientific market is fickle and quite honestly, the factors that define success are not alway...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3363772</comments>
            <pubDate>Sat, 13 Mar 2010 21:19:29 +0100</pubDate>
            <guid isPermaLink="false">3363772</guid>        </item>
        <item>
            <title>The distributed web of data – messaging included</title>
            <link>http://www.medworm.com/index.php?rid=3248663&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FTvnl_7QU6hU%2F</link>
            <description>I&amp;#8217;ve written about the distributed self and science data platforms. A lot of the former was around the notion of pubsub, and pushing data to various places. Now imagine a scenario where you are using data from a variety of scientific repositories and you&amp;#8217;ve built applications that use APIs to collect data. What if your data sources would update you everytime there was a change, so that your systems could automatically fetch any updates and rebuild anything that needed to be rebuilt, do any pre-computing that needed to be done. The model that Anil Dash talked about in his classic Push-Button Web post is relevant here as well.

We have the tools to do this today. Real time, asynchronous messaging is part of distributed computing, and the variety of data repositories out there sho...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3248663</comments>
            <pubDate>Sat, 06 Feb 2010 23:05:03 +0100</pubDate>
            <guid isPermaLink="false">3248663</guid>        </item>
        <item>
            <title>The new javascript Map/Reduce in Riak</title>
            <link>http://www.medworm.com/index.php?rid=3239750&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F__bk0ve1EoE%2F</link>
            <description>An Introduction to JavaScript Map/Reduce in Riak from Basho Technologies on Vimeo.
Riak is a non-relational datastore with a cool API and nifty Map/Reduce features. The new features in version 0.8 are described here (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3239750</comments>
            <pubDate>Thu, 04 Feb 2010 04:09:22 +0100</pubDate>
            <guid isPermaLink="false">3239750</guid>        </item>
        <item>
            <title>Video: Building a data intensive application with Hadoop and Hive</title>
            <link>http://www.medworm.com/index.php?rid=3163977&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FhPNIq8hQG5Q%2F</link>
            <description>I have written about TrendingTopics before. Pete Skomoroch gave a talk on how to build a data intensive web app using Hadoop, Hive and Amazon EC2 at Hadoopworld and the video is now available

Building Data Intensive Apps with Hadoop and EC2 from Cloudera on Vimeo.
Please see this disclaimer (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3163977</comments>
            <pubDate>Tue, 12 Jan 2010 02:52:56 +0100</pubDate>
            <guid isPermaLink="false">3163977</guid>        </item>
        <item>
            <title>To handle lots of data, we need to think differently</title>
            <link>http://www.medworm.com/index.php?rid=3157623&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FYk77IaRONdw%2F</link>
            <description>In a recent editorial (sub might be required) talking about next-gen sequencing and cloud computing, Nature Biotech makes an all to familiar error.

	It remains unclear, however, whether the cost of routinely renting time on the cloud would be cost effective in the long term, particularly if a user intends to analyze billions of base pairs of genome sequence on a regular basis. What&amp;#8217;s more, if the wide uptake of sequence analysis on clouds depends on the availability of user-friendly, debugged software, bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud—especially when their jobs focus on developing algorithms for their own local computer clusters.

The context for that...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3157623</comments>
            <pubDate>Sat, 09 Jan 2010 17:09:35 +0100</pubDate>
            <guid isPermaLink="false">3157623</guid>        </item>
        <item>
            <title>In science, data is nothing without purpose</title>
            <link>http://www.medworm.com/index.php?rid=3156614&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FqmRZhyipCJ0%2F</link>
            <description>In an article on TechFlash, a VC, talking about trends in 2010, had this to say while talking about increased IT needs in cleantech and biotech

	Both areas are generating terabytes of data and it is no longer just about science &amp;#8212; it is about digesting mountains of data.

For some reason that statement scared me. Digesting mountains of data is all about the science. If we forget that, we are in big trouble. Yes, from a pure technology perspective it is about digesting mountains of data, but (a) that has to be looked at in the context of science (sense-making?), and (b) the digesting is a necessary pre-requisite to getting to the science. You really don&amp;#8217;t have much of a choice, but if you forget about the science, you will end up with noise, a whole lot of it. 
My advice to all ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3156614</comments>
            <pubDate>Sat, 09 Jan 2010 04:16:39 +0100</pubDate>
            <guid isPermaLink="false">3156614</guid>        </item>
        <item>
            <title>More musings on MapReduce and bioinformatics</title>
            <link>http://www.medworm.com/index.php?rid=3126747&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FK2eBlK77BAc%2F</link>
            <description>Jeff Dean and Sanjay Ghemawat have an updated MapReduce paper (doi) in the Communications of the ACM. The paper is a pretty strong rebuttal to some claims by Mike Stonebraker and others on the value of the MapReduce model. I am going to let you read the paper (as well as the original papers). What I wanted to talk about were some of the key aspects of the MapReduce model and how this way of thinking is relevant to the life sciences.
The first point that Dean and Ghemawat talk about is heterogenous systems. The way I see it, the entire field of bioinformatics is full of heterogenous systems. Even data we generate in internal systems needs to be combined with data from other systems. In fact, I am pretty sure that as we improve delivery models and APIs for life science data resources, we wil...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3126747</comments>
            <pubDate>Tue, 29 Dec 2009 06:15:08 +0100</pubDate>
            <guid isPermaLink="false">3126747</guid>        </item>
        <item>
            <title>Bioinformatics and mythology.  You still need to manage the data</title>
            <link>http://www.medworm.com/index.php?rid=3075709&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F4w3p-VKmQbg%2F</link>
            <description>Image by dullhunk via Flickr



In a great blog post at Code for Life, Grant Jacobs writes
By contrast, early bioinformatics work was almost invariably founded on biological concepts from the onset. A biological issue was raised and then a technique to address that issue was presented. That is, theoretical biology was the foundation on which [early] bioinformatics was built. I fear this is being lost in the mass-data and technology-hype driven bioinformatics. It seems to me that unless companies and research groups are careful many will waste time and money “stamp collecting and cataloging”. Certainly the organized data is useful, but only if it is applied with biological principles
Grant writes this in the context of the early days of bioinformatics, a time when there was a lot of the...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3075709</comments>
            <pubDate>Thu, 10 Dec 2009 03:43:17 +0100</pubDate>
            <guid isPermaLink="false">3075709</guid>        </item>
        <item>
            <title>Data platforms for science – From data to work</title>
            <link>http://www.medworm.com/index.php?rid=3035994&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F-Qsax7tnpv8%2F</link>
            <description>At SC09, in my Systems Biology talk, I spoke about platforms for data. The idea is hardly original, since I&amp;#8217;ve written about this before, and my ideas borrow heavily from Jeff Hammerbacher and Matt Wood among others. But I wanted to add some more meat to it in writing.
Today we live in a world where we generate data from instruments, various experiments or simulations. These data can be used to provide us insights, and we want to add these insights to our data, capture those insights in the context of the data they represent and then keep track of the data and metadata for future changes. We do this in a world where data is generated by different people, different people care about different pieces of the follow on insights and information and perhaps a third set try and put this all...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3035994</comments>
            <pubDate>Sat, 28 Nov 2009 00:12:58 +0100</pubDate>
            <guid isPermaLink="false">3035994</guid>        </item>
        <item>
            <title>Talks from SC09</title>
            <link>http://www.medworm.com/index.php?rid=3012564&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2Fq88HQWLeZ2I%2F</link>
            <description>Up on slideshare
Talk given at &amp;quot;Cloud Computing for Systems Biology&amp;quot; workshop
View more documents from Deepak Singh.

Masterworks talk on Big Data and the implications of petascale science
View more documents from Deepak Singh.

All talks can be found here (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=3012564</comments>
            <pubDate>Thu, 19 Nov 2009 16:53:19 +0100</pubDate>
            <guid isPermaLink="false">3012564</guid>        </item>
        <item>
            <title>Matt’s manifesto for a science data platform</title>
            <link>http://www.medworm.com/index.php?rid=2939486&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F0G7cq8m61qs%2F</link>
            <description>There are a select few people whose every word I try and absorb and chew on because I have great respect for their thinking and intelligence. Matt Wood is one of those people, and today he decided to tweet a manifesto. The whole series started with
I&amp;#8217;m starting a manifesto. There are no technical, political or funding reasons why an open data platform for science couldn&amp;#8217;t excel
He then followed that up with five tweets (Matt&amp;#8217;s Twitter stream). I don&amp;#8217;t know if that&amp;#8217;s the entire manifesto, but I reproduce those tweets below, a series entitled Towards a science data platform

Easy, flexible retrieval and reuse above all else
A laser sharp focus on scientific productivity and progress
Scalability and speed are not mutually exclusive
Well designed, high quality pro...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2939486</comments>
            <pubDate>Thu, 29 Oct 2009 03:37:13 +0100</pubDate>
            <guid isPermaLink="false">2939486</guid>        </item>
        <item>
            <title>When HPC will not be the HPC you remember</title>
            <link>http://www.medworm.com/index.php?rid=2836304&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FkYErcJwifaI%2F</link>
            <description>Image via Wikipedia



Just read the transcript of what sounded like an excellent talk by Greg Pfister about the next 20 years of HPC. Here are some of the key points of his talk

Computing will become cheaper, but not necessarily much faster per processor
There will be democratization of at least some HPC. In other words with faster processors and accelerators, we might all have access to some sort of TeraFLOPS device
Computing will be done all over the place, with a lot being done in the cloud. I am not quite sure I get what Greg was aiming at with his section on garbage computing, but my guess is that the cycles we&amp;#8217;ll consume might not be the highest quality cycles but they&amp;#8217;ll get the job done
You will be billed by how much power and bandwidth your computation consumes, not ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2836304</comments>
            <pubDate>Sun, 27 Sep 2009 03:24:36 +0100</pubDate>
            <guid isPermaLink="false">2836304</guid>        </item>
        <item>
            <title>Invite codes for Infochimps.org</title>
            <link>http://www.medworm.com/index.php?rid=2824367&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FZ9-LJglzYaU%2F</link>
            <description>Today I got a chance to see a short pitch by Dhruv Bansal from Infochimps announcing the launch of &amp;#8220;the world&amp;#8217;s largest open platform for data&amp;#8221;. I have talked about the Infochimps here before, but this announcement launches them as a marketplace for data, and they have kindly given me 50 invite codes. So if you use the code &amp;#8220;bigbiodata&amp;#8221; you can sign up for a beta account as well.  Let&amp;#8217;s get our data out there.
Related articles by Zemanta

Infochimps: Share and Sell Your Raw Data (readwriteweb.com)
Infochimps Wants Folks to Monkey Around With Its Data (gigaom.com)
DEMO: Infochimps lets users share and sell data (venturebeat.com)

Since some of the data seta on AWS Public Data Sets come from the good folk at the Infochimps, please read this disclaimer (Sou...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2824367</comments>
            <pubDate>Wed, 23 Sep 2009 02:27:38 +0100</pubDate>
            <guid isPermaLink="false">2824367</guid>        </item>
        <item>
            <title>Modern computing paradigms and the life sciences</title>
            <link>http://www.medworm.com/index.php?rid=2809848&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FruD3vCSXd08%2F</link>
            <description>Over the next few months, I&amp;#8217;ll be giving a bunch of talks about large scale data. I will be talking at Hadoop World about &amp;#8220;Hadoop for Bioinformatics&amp;#8220;, at a Cloud Computing for Hedge Funds and at Supercomputing. Thinking through what I want to cover at all these talks has my brain in overdrive these days. At the same time discussions with various people facing data-related challenged provides a reality check and reminds me that there is still a long way to go.
The one concept that people need to start putting their head around is the relative location of compute and data. Many people still think along the &amp;#8220;move data to the compute&amp;#8221; paradigm. When our data sets were small, this was not a problem. As instruments provide more and more data, ever faster, that tends...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2809848</comments>
            <pubDate>Sat, 19 Sep 2009 03:33:17 +0100</pubDate>
            <guid isPermaLink="false">2809848</guid>        </item>
        <item>
            <title>Tagging, context and … data</title>
            <link>http://www.medworm.com/index.php?rid=2768794&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FIqppyG9Vc3M%2F</link>
            <description>Image via Wikipedia



I subscribe to a mailing list of foo camp alumni, and there was a question around tagging. About systems where tags are added by the creator vs. systems where the tags are created by the consumer. Joshua Shachter pointed out that delicious&amp;#8217; core paradigm is to have someone else, not the content creator do the tagging. Got me thinking about intent and something a former colleague always talking about; &amp;#8220;your point of view&amp;#8221;. The creators intent and the consumers interpretation or context may not be the same. So the tags you might use might be different from those the content creator, or some other consumer, chooses. Not just because of a different tag convention, but because you have a different context from the creator or other consumer.
You can exten...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2768794</comments>
            <pubDate>Sat, 05 Sep 2009 03:55:53 +0100</pubDate>
            <guid isPermaLink="false">2768794</guid>        </item>
        <item>
            <title>Speculative Execution in Hadoop</title>
            <link>http://www.medworm.com/index.php?rid=2762089&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FLEO5gXFx4L0%2F</link>
            <description>Image via CrunchBase



Disclaimer: Newbie post ahead
One of the more fascinating aspects of Hadoop is speculative execution. In many bioinformatics setups, there is some logic written, which examines your available resources, especially if you are using Sun Grid Engine, LSF, etc, the size of your input and chunks up your data appropriately and makes that data available to various nodes for computing on that chunk. In most of the implementations that I am aware of, this is done using a shared filesystem, often an NFS server. More recently, cluster file systems have become more popular for their improved availability characteristics. But in most pipelines job completion is analyzed post-job and you re-run any failed job. When you have a few 100 GB&amp;#8217;s of data and a few 100 jobs that don...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2762089</comments>
            <pubDate>Wed, 02 Sep 2009 22:34:49 +0100</pubDate>
            <guid isPermaLink="false">2762089</guid>        </item>
        <item>
            <title>Is XML bad for big data?</title>
            <link>http://www.medworm.com/index.php?rid=2727346&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FLEcPpN_E_AI%2F</link>
            <description>Image via Wikipedia



Mike Driscoll continues his attack against XML for Big Data. He points out three reasons why XML and Big Data are strange bedfellows. 

XML spawns data bureacracy, which is why JSON exists
Size matters and XML is not exactly concise
XML is complex and has a cost

One of my problems with XML, and this is from someone who loves markup, has always been that it is used in ways it was never intended to be, or at least I hope not. It is a representational format for documents, but ended up becoming the format for all kinds of data standards and worst of all, data transport. He proposes some rules

Don&amp;#8217;t invent new formats. I think Hari will wholeheartedly agree with this one. This, in particular, is the bane of science. We invent new formats all the time and sometime...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2727346</comments>
            <pubDate>Sun, 23 Aug 2009 07:31:58 +0100</pubDate>
            <guid isPermaLink="false">2727346</guid>        </item>
        <item>
            <title>Data is not document centric</title>
            <link>http://www.medworm.com/index.php?rid=2725176&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FMtvX7WgdIFk%2F</link>
            <description>In recent days I have talked a lot about Big Data and data streams on the web, especially for science.  The good news is that there is a lot of good material for inspiration out there these days.  The latest source of inspiration is a blog post by Mike Driscoll at Dataspora.  In the rise of the data web Mike writes that through our various frameworks we have industrialized the creation of hypertext (as an aside I am playing with Jekyll these days and it rocks) as well as the collection of data, making human data entry much less common.  He also points out that while the web we see will be dominated by documents, the web we can&amp;#39;t see that is surging with data.   This is the data web that fascinates me and many others, a web of data streams which we can orchestrate, slice and dice ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2725176</comments>
            <pubDate>Fri, 21 Aug 2009 06:25:05 +0100</pubDate>
            <guid isPermaLink="false">2725176</guid>        </item>
        <item>
            <title>Rajarshi Guha on Crunching Molecules and Numbers in R</title>
            <link>http://www.medworm.com/index.php?rid=2712295&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FTBdxzpwxRBU%2F</link>
            <description>http://www.slideshare.net/rguha/crunching-molecules-and-numbers-in-r
Rajarshi talks about how R can be a powerful tool for cheminformatics and discusses some of the pros and cons.
 Is this the first Cheminformatics talk at an ACS National Meeting to bring up Hadoop?  Probably not, but good to see it there (slide 35).  Saptarshi Guha&amp;#39;s RHIPE also gets a mention


Related articles by Zemanta

Adding Hbase to Our Cluster (travishegner.com)
Amazon teaches cloud to speak Pig Latin (theregister.co.uk)


 Posted via email  from Flashing Neurons!!! (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2712295</comments>
            <pubDate>Wed, 19 Aug 2009 07:24:29 +0100</pubDate>
            <guid isPermaLink="false">2712295</guid>        </item>
        <item>
            <title>Wholesale data</title>
            <link>http://www.medworm.com/index.php?rid=2702462&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FDXP3R43gRww%2F</link>
            <description>There is a fascinating post on the Sunlight Labs blog arguing that the Government should be a wholesaler of data. They make some very compelling arguments, and I pretty much agree with them. If I had half their eloquence, I would have made a similar argument about scientific data some time ago. But I am going to shamelessly take the ideas in the Sunlight post.
Leaving aside some of the questions around what constitutes raw data, many of us have argued that raw scientific data should be made available so that different scientists can look at the data with their own lens and either verify claims, uncover new scientific findings, or provide interfaces that allow others to query the data in all kinds of ways. While we do have ftp servers for genomic data at organizations like the NCBI, it is c...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2702462</comments>
            <pubDate>Sat, 15 Aug 2009 01:15:21 +0100</pubDate>
            <guid isPermaLink="false">2702462</guid>        </item>
        <item>
            <title>Making sense of all that data: Integrating and extracting information from dataspaces</title>
            <link>http://www.medworm.com/index.php?rid=2667623&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2Fhw9q25p_DmU%2F</link>
            <description>I have written previously about Trendingtopics.org as a reference site for data analytics using Hadoop and Hive.
Pete Skomoroch, who developed the site has written a great follow up article on the Cloudera blog that anyone in the scientific informatics space needs to read. Those same people need to read Chapter 5 in Beautiful Data. There Jeff Hammerbacher writes about the role of the Data Scientist at Facebook.
So why should people in the scientific community read those two resources? Big Data is now a fact of life, both in science and otherwise. Our systems, both from the infrastructure and computation standpoint were written for data sets of different sizes. At the same time we are moving towards a world where the need to combine data resources is becoming even more necessary than in the...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2667623</comments>
            <pubDate>Mon, 03 Aug 2009 23:28:57 +0100</pubDate>
            <guid isPermaLink="false">2667623</guid>        </item>
        <item>
            <title>Petaflops meet Petabytes</title>
            <link>http://www.medworm.com/index.php?rid=2657838&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FzxgAjC8LQCY%2F</link>
            <description>Image via Wikipedia



Dan Reed has written a lot of interesting posts/essays lately, many of which have been covered here. He comes from a world where compute horsepower is king, admittedly a world I cared about for a long time, and still do to this day. But many of us today live in a different world, where the focus on computing isn&amp;#8217;t high performance, but data intensive. So it&amp;#8217;s interesting when in a recent blog post, Dan writes
One of the major lessons from web search and cloud data centers is the power of truly massive scale, near real-time data analysis. When anyone with a cheap cell phone and a web browser can extract data and insights from a non-trivial fraction of the human knowledge base, behavior and culture are transformed. I would like to believe that we can bring ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2657838</comments>
            <pubDate>Fri, 31 Jul 2009 05:12:23 +0100</pubDate>
            <guid isPermaLink="false">2657838</guid>        </item>
        <item>
            <title>Lots of data and the network</title>
            <link>http://www.medworm.com/index.php?rid=2649206&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FdkywvyChkEo%2F</link>
            <description>Big Data and the networked future of science; not only is that the title of my talk at Ignite Seattle 7 next week, but also was the working title of my talk at VA Tech. I strongly believe that the future of the life sciences will include not only large data sets, but the need to merge and co-analyze diverse data sources. The size and complexity of these projects and the skills required means we need to think about new ways of addressing these data volumes. These challenges are being recognized across the board. In a recent abstract from a paper on NetSolve/D, the authors (including Jack Dongarra) 
The persistent mood of exhilaration in the research community over exponential increases in the capacity of computational resources has been tempered recently by the realization that a torrential...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2649206</comments>
            <pubDate>Wed, 29 Jul 2009 03:44:44 +0100</pubDate>
            <guid isPermaLink="false">2649206</guid>        </item>
        <item>
            <title>The new data engines</title>
            <link>http://www.medworm.com/index.php?rid=2630315&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FhFJFPSuH1HQ%2F</link>
            <description>Been thinking a lot about data, not in the least cause I have to start thinking about my talk at Supercomputing. There have also been a number of recent meetings, podcasts, blog posts, etc that have me thinking about data, and managing data again. I won&amp;#8217;t talk about the specifics from the meetings I&amp;#8217;ve been in, but a lot of the discussion and thinking has been around large quantities of data, ranging from unstructured data to highly structured data and how we can analyze them more efficiently.
The one thing I can talk about is this blog post describing the release of HadoopDB. HadoopDB is a new stack that combines PostgreSQL, Hadoop and Hive, essentially combining MapReduce with DBMS technologies, specifically targeted at the analytics crowd. I won&amp;#8217;t talk about pluses and...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2630315</comments>
            <pubDate>Thu, 23 Jul 2009 06:28:10 +0100</pubDate>
            <guid isPermaLink="false">2630315</guid>        </item>
        <item>
            <title>All data are not the same</title>
            <link>http://www.medworm.com/index.php?rid=2602164&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FTrpbcbCi6sQ%2F</link>
            <description>Greg Linden (super smart fellow by all accounts) has an really nice article in the ACM blog. In the article, he writes about machine learning and data in the context of the Netflix prize. He concludes his article with the following 
There are a lot of lessons that can be taken from the Netflix contest, but a big one should be the importance of constant experimentation and learning. By competing algorithms against each other, by looking carefully at the data, by thinking about what people want and why they do what they do, and by continuous testing and experimentation, you can reap big gains.
Data is peculiar, and throwing of the shelf algorithms at data gets you that far, and usually gives you reasonable results, but in the end you really need to understand your data and it&amp;#8217;s peculia...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2602164</comments>
            <pubDate>Wed, 15 Jul 2009 01:55:41 +0100</pubDate>
            <guid isPermaLink="false">2602164</guid>        </item>
        <item>
            <title>TrendingTopics.org: A reference site for data analytics in Hadoop and Hive</title>
            <link>http://www.medworm.com/index.php?rid=2469824&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FWCFA7hOvCok%2F</link>
            <description>In episode 21 of Coast to Coast Bio (not yet released) I talk about Hive. For those who may not know, Hive is a data warehouse infrastructure built on top of Hadoop.
One of the most recent Amazon Public Data Sets is a sample of Wikipedia page stat statistics by Peter Skomoroch. The full data set powers trendingtopics.org. 
What is TrendingTopics?
This site was built by Data Wrangling to demonstrate how Hadoop can power a simple data driven website. The trend statistics and time series data that run the site are updated periodically by launching a temporary EC2 cluster running the Cloudera Hadoop Distribution. Our initial seed data includes the content of wikipedia and hourly article traffic logs from the wikipedia squid proxy collected by Domas Mituzas.
Why do I like this so much? Apart fr...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2469824</comments>
            <pubDate>Wed, 10 Jun 2009 14:50:32 +0100</pubDate>
            <guid isPermaLink="false">2469824</guid>        </item>
        <item>
            <title>The future of big compute for big science</title>
            <link>http://www.medworm.com/index.php?rid=2349284&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FGS_LscZ1E5A%2F</link>
            <description>Image via Wikipedia
As readers of bbgm know, one of the subjects that interests me the most is computing, even though I am hardly a guru, but I&amp;#8217;ve been around long enough and close enough to the world of computing to notice and observe various trends in this space. Perhaps the most recent trend, and it appears a new phrase in the computing lexicon is data intensive computing, something that life scientists should be very cognizant of today, as I talk about in Science Big, Science Connected. This interest in data intensive computing and recent exposure to data centers and large scale operations has led me to pay even more attention to some blogs I&amp;#8217;ve been following for a while. Specifically, two blogs that I read religiously are those of James Hamilton and Dan Reed. 
James write...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2349284</comments>
            <pubDate>Mon, 13 Apr 2009 04:49:27 +0100</pubDate>
            <guid isPermaLink="false">2349284</guid>        </item>
        <item>
            <title>Bursting on to a cloud</title>
            <link>http://www.medworm.com/index.php?rid=2323809&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FZpMYXbLqoYI%2F</link>
            <description>OK, cheesy title, but this one pleases me at multiple levels. It was work done on EC2. It is one of the first examples of a MapReduce implementation of something many people will find useful in the world of bioinformatics. And it is one of the sample apps for Elastic MapReduce.
Now, it&amp;#8217;s a peer-reviewed paper. What is Cloudburst?
CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics. It is modeled after the short read mapping program RMAP, and reports either all alignments or the unambiguous best alignment for each read with any number of mismatches or differences.

Yep, Cloudburst is ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2323809</comments>
            <pubDate>Thu, 09 Apr 2009 04:58:41 +0100</pubDate>
            <guid isPermaLink="false">2323809</guid>        </item>
        <item>
            <title>Data produced, analyzed and consumed.  The impact of big science</title>
            <link>http://www.medworm.com/index.php?rid=2323817&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FkgXLtKnKAgg%2F</link>
            <description>When genome centers have to start thinking about large scale data center operations you know something is different. In Science Big, Science Connected, I talked about how the availability of high throughput instruments has fundamentally changed our approach to science. On Coast to Coast Bio, Hari and I often argue about whether this is for the better (I like big science, he isn&amp;#8217;t as fond of it). In the end, those differences boil down to funding priorities.

The fact remains that today we are moving towards a clear separation between data producers, data consumers and methods developers. There was a time that a small group of people could cover all that ground, but with the industrialization of data production (microarrays are already there, mass specs and sequencers not quite yet), ...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2323817</comments>
            <pubDate>Wed, 01 Apr 2009 03:06:50 +0100</pubDate>
            <guid isPermaLink="false">2323817</guid>        </item>
        <item>
            <title>Crunch that data</title>
            <link>http://www.medworm.com/index.php?rid=2232814&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2Fjz9AxkXgJeU%2F</link>
            <description>Disclaimer: This relates to my day job, so please see the standard disclaimer
I love a good data set, love open data even more, and open data that can be computed upon most of all. Loads of new data sets on Public Data Sets on AWS today, including a huge amount of Genbank data. Also included one of my favorite data sources; Freebase, both the Freebase dump, and the WEX dump.
Personally, I believe that data only goes so far. In the end, like all the information streams available to us, the value comes from the applications and tools, tools like MachetEC2, which includes biopython, numpy, R, etc. Would like to see similar science specific packages up on Github, where we can contribute. Maybe an image which includes biopython, bioruby, bioperl, Ensembl tools, Blast, HMMer, R, etc. 
Related ar...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2232814</comments>
            <pubDate>Wed, 25 Feb 2009 06:15:04 +0100</pubDate>
            <guid isPermaLink="false">2232814</guid>        </item>
        <item>
            <title>Data distribution and versioning</title>
            <link>http://www.medworm.com/index.php?rid=2132536&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FnSMEoamP2fA%2F</link>
            <description>Image by matthewsim via FlickrSharing your changes is a great post on some of the advantages of using Git (or any distributed version control system). Rich Apodaca has an even more interesting post on using GitHub for chemistry, particularly in the context of revision controlled datasets.
In general, we are getting increasingly interested in leveraging public data resources. Indeed, even in pharma there are people who have a great interest in combining internal data with public data to try and get more relevant results. But perhaps the biggest trend going forward is going to be the development of mechanisms that allow you to fork and remix data, much in the way we have done with code and media. The same paradigms apply, although the mechanisms might vary. The comment thread on Rich&amp;#8217;s...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2132536</comments>
            <pubDate>Sun, 25 Jan 2009 20:59:57 +0100</pubDate>
            <guid isPermaLink="false">2132536</guid>        </item>
        <item>
            <title>Chunking up visualization</title>
            <link>http://www.medworm.com/index.php?rid=2121804&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FNRHoilOsbeE%2F</link>
            <description>Image via WikipediaTechnology Review has an article about visualization software specifically designed for big data. In recent years we have seen trends towards algorithms and methods designed towards dealing with large amounts of data using commodity hardware. We&amp;#8217;ve already seen the Map-Reduce algorithm being applied to any number of data driven problems, including the analysis of large molecular dynamics trajectories. 
The software described in Technology Review, likely best described in Attila Gyulassy&amp;#8217;s PhD Thesis, takes the common approach of breaking down a problem in smaller chunks, easier said than done when dealing with visualization. Now, I am not exactly a visualization guru, so how good this work is in practice I cannot say. But it is part of a trend that I really l...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2121804</comments>
            <pubDate>Thu, 22 Jan 2009 05:32:21 +0100</pubDate>
            <guid isPermaLink="false">2121804</guid>        </item>
        <item>
            <title>download, mirror, fork</title>
            <link>http://www.medworm.com/index.php?rid=2115636&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FR2_5BVD3vrI%2F</link>
            <description>One of my favorite sessions at ScienceOnline&amp;#8217;09 was the one on Semantic Web in Science moderated by John Wilbanks. In some ways this was the most traditional session at the event, but it worked, since John brings a level of credibility to the subject few can, with this background in science, technology and policy. There was a lot of Q&amp;A related to the Semantic Web in general and discussions around policy. But the meat for me was John&amp;#8217;s talk itself. I believe that John has probably done the best job of articulating the role of the public domain for scientific data, and he brought a new twist to it in this talk, at least the first time I&amp;#8217;ve heard someone talk about data like this. That was the concept of download, mirror, fork, which makes so much sense, that I am mad t...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2115636</comments>
            <pubDate>Tue, 20 Jan 2009 03:56:54 +0100</pubDate>
            <guid isPermaLink="false">2115636</guid>        </item>
        <item>
            <title>Connected data and the tipping point</title>
            <link>http://www.medworm.com/index.php?rid=2094845&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2Fv_amPhziqik%2F</link>
            <description>How far are we from the tipping point of Big Data? When will the world’s icebergs of data melt into one sea? More importantly, when it happens, will we be ready to do something useful with it all?

Those are questions asked by Michael Driscoll in is big data at a tipping point. The post came to my attention via Paul Kedrosky and talks about a potential tipping point for Big Data, which occurs in a connected world. He goes on to talk about various data efforts, including one that I care about a lot :). 
But lets return to the key part of his thesis. He believes that the transition from relatively unconnected to mostly connected, occurs when we have about half as many nodes as edges. This also fits in somewhat with the jigsaw puzzle analogy that Jeff Jonas uses for perpetual analytics. The...</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2094845</comments>
            <pubDate>Sat, 10 Jan 2009 20:40:00 +0100</pubDate>
            <guid isPermaLink="false">2094845</guid>        </item>
        <item>
            <title>Big data meet tech</title>
            <link>http://www.medworm.com/index.php?rid=2021565&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2F0cOtEK4BTa0%2F</link>
            <description>For the longest time, I&amp;#8217;ve hoped that the tech world would see the beauty of science, and the complexity of modern biology. A lot of the material that makes it into tech blogs is futuristic, alarmist, or plain old misinformed, or at the very least misrepresented. Well all that changes today. Matt Wood, someone whose thoughts on big data have influenced my own quite a bit, is now blogging for the O&amp;#8217;Reilly Radar. You have someone from the trenches of modern high throughput biology writing for one of the better tech blogs. This can only be good.
Matt&amp;#8217;s first post is about the Challenges for the new genomics. His talk on The New Genomics is worth a listen as well (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=2021565</comments>
            <pubDate>Tue, 09 Dec 2008 05:57:58 +0100</pubDate>
            <guid isPermaLink="false">2021565</guid>        </item>
        <item>
            <title>Science Big, Science Connected</title>
            <link>http://www.medworm.com/index.php?rid=1960825&amp;cid=t_200769_132_f&amp;fid=35011&amp;url=http%3A%2F%2Ffeedproxy.google.com%2F%7Er%2Fmndoci%2F%7E3%2FsDz-EwWfF9Q%2F</link>
            <description>The first attempt at distilling some of my thoughts on Big Data and the Networked Future of Science. Thanks to Chris Lasher for the invite to speak at VA Tech. I had fun, although in my jetlagged, uber-caffeinated state I spoke at 200 mph

Science Big, Science Connected
View SlideShare presentation or Upload your own. (tags: science science2.0) (Source: business|bytes|genes|molecules)</description>
            <author>business|bytes|genes|molecules</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=1960825</comments>
            <pubDate>Fri, 14 Nov 2008 22:17:31 +0100</pubDate>
            <guid isPermaLink="false">1960825</guid>        </item>
        <item>
            <title>Two Journal Special Issues: Big Data, and Semantic Mashups for Bioinformatics</title>
            <link>http://www.medworm.com/index.php?rid=1859507&amp;cid=t_200769_132_f&amp;fid=35028&amp;url=http%3A%2F%2Flurena.vox.com%2Flibrary%2Fpost%2Ftwo-journal-special-issues-big-data-and-semantic-mashups-for-bioinformatics.html%3F_c%3Dfeed-rss</link>
            <description>Both of these special issues are worth a look, as some of the papers look pretty interesting. I'll spend a little time in a later post on any articles I find particularly relevant.  Semantic Mashup of Biomedical Data Special Issue of the Journal...   
  Read and post comments  |  
  Send to a friend (Source: Systems Biology &amp; Bioinformatics)</description>
            <author>Systems Biology &amp; Bioinformatics</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=1859507</comments>
            <pubDate>Tue, 07 Oct 2008 12:55:42 +0100</pubDate>
            <guid isPermaLink="false">1859507</guid>        </item>
        <item>
            <title>Science in the petabyte era</title>
            <link>http://www.medworm.com/index.php?rid=1759821&amp;cid=t_200769_132_f&amp;fid=35006&amp;url=http%3A%2F%2Fnsaunders.wordpress.com%2F2008%2F09%2F04%2Fscience-in-the-petabyte-era%2F</link>
            <description>Just a brief note: the title of this post is taken from the cover of today&amp;#8217;s Nature. It contains several very good feature articles on the challenges of dealing with peta- (and more) byte size datasets, grouped under the heading &amp;#8220;Big data&amp;#8221;.
Nature contents Sep 4 2008.
Nature News Big Data special.
By far the best of the articles is The future of biocuration: it offers practical recommendations, as opposed to the &amp;#8220;gee whizz, what a lot of data&amp;#8221; approach. Not least of which: &amp;#8220;curators, researchers, academic institutions and funding agencies should, in the next ten years, increase the visibility and support of scientific curation as a professional career.&amp;#8221;
Almost as good are Wikiomics, which tackles the lack of participation issue and Welcome to the p...</description>
            <author>What You're Doing Is Rather Desperate</author>
            <type>blogs</type>
        <comments>http://www.medworm.com/rss/comments.php?id=1759821</comments>
            <pubDate>Thu, 04 Sep 2008 04:36:38 +0100</pubDate>
            <guid isPermaLink="false">1759821</guid>        </item>
    </channel>
</rss>

