Jul 30, 2010

The Apache Software Foundation, best known for the self-named popular open source  Web server, has a variety of other technology that can help enterprises with databases, search, development, and more. Here’s an overview of the best.

Now more than 10 years old, the Apache Software Foundation has become a repository of some of the most important open source technologies.

The foundation was created to formalize the organizational structure surrounding its flagship Apache HTTP web server and to apply the same philosophy to shepherding other projects – now, nearly 150 of them.

Open source provides access to technologies to power innovation without software licensing costs, especially valuable in an economic downturn. But to use open source effectively, enterprises need to understand the tradeoffs, such as the need for developers skilled with working with open source projects rather than polished commercial products.

Justin Lundy, a systems integration and information security consultant currently working for a major bank, says every Fortune 500 organization he has worked for took advantage of a couple of core Apache components: the web server and Apache Tomcat,  which is Apache’s Java web servlet engine. “It’s the Fortune 500 companies that are truly interested in innovating that we can see adopting some of the newer projects,” he says, pointing to distributed computing and data analysis technologies like Hadoop, HBase, Cassandra, and CouchDB.

“I think we will begin to see more and more migrations away from expensive commercial data warehousing products, database products, and a growing number of enterprises that embrace open source frameworks for distributed databases and distributed data warehousing and analytics,” Lundy says.

Not that this style of computing is for everyone, given the creative chaos of the open source movement. One CIO surveyed for this story wrote that Apache technologies “are not even on our radar at the moment” because he is focused on simplifying and standardizing the technologies in his environment, rather than experimenting with new ones.

But for those feeling a little more adventurous, here are some things to look at.

Hadoop

The Hadoop project attracts buzz because it mimics the style of distributed computing used internally at Google, using technologies cloned on the basis of academic papers published by Google engineers. Using the MapReduce style of programming, developers write data processing routines that fan out across clusters of computers and feed back their results. This turns out to be an efficient way of performing data intensive tasks on commodity hardware.

Yahoo, a major backer of the Hadoop project, has incorporated the technology into its own web data analysis processes. Cloudera, a company formed to provide commercial distribution and support for Hadoop and related technologies, also claims corporations like Bank of America and Samsung as customers. Proving that Hadoop is not just for web data, Samsung says it is applying the technology to its bioinformatics business — albeit delivered as a cloud service.

HBase

A subproject of Hadoop, HBase is another clone of a Google technology, this one known as BigTable, that is used to manage very large database tables — up to billions of rows, millions of columns — with the data stored on clusters of commodity servers. This is structured data storage and analysis, just not according to relational database rules. HBase provides a BigTable-like solution that runs on top of Hadoop.

Cassandra

Cassandra is Facebook’s contribution to the field of Big Data management and analysis. Originally invented to manage Facebook user account data, the code was contributed to the Apache project in 2008 and is now maintained and refined by participants from many companies. Cassandra adopted some of the concepts from Google’s BigTable as well as published details on Amazon.com’s Dynamo distributed computing technologies.

CouchDB

Another non-relational database, designed for easy replication across many nodes and data access via a REST API, meaning that documents and records are posted to and retrieved from the database over the Web’s HTTP protocol. The BBC is one organization that has talked about using CouchDB in combination with Apache Tomcat to build a cost effective content management system that can be replicated across data centers.

Lucene and Solr

Lucene is an umbrella project for developing open source search software, including the Lucene Java library and the Lucene.NET port to C#. Solr is a high-performance implementation of Lucene Java that has been adopted by organizations such as MTV Networks for search applications on their websites.

Nutch

Lundy points to Nutch as an affordable alternative to purchasing a Google Search Appliance. Originally a sub-project of Lucene (as Solr is now), Nutch was reclassified as an Apache top-level project this year in recognition of its growing maturity. Building on top of Lucene, Nutch adds facilities for crawling, parsing, and indexing web documents.

It can also take advantage of Hadoop clusters. Hadoop actually originated as part of the Nutch project.

Tomcat

A free alternative to Java application servers, particularly for situations where basic Java Servlet and Java Server Pages technologies are required and heavy-duty Java Enterprise Edition technologies would be overkill. Tomcat can also be used as the front end to more sophisticated back-end Java technologies.

Struts

A web application framework that extends the Java Servlet API to support a model view controller (MVC) programming model. In other words, it provides a mechanism for enforcing a clean separation between the presentation of an application (the user interface) and the logic behind the application with the goal of simplifying maintenance of the code.

Several competing open source technologies fill a similar role, such as the Spring Framework.

Geronimo

Apache Geronimo pulls together many open source Java alternatives to produce a fully certified Java Enterprise Edition 5 application server.

Axis2

Axis2 is part of the Apache Web Services project. A Web Services engine for the SOAP and WSDL protocols for distributed invocation of services via XML messaging, as well as REST. The primary implementation is in Java, although a port for C is also available.

Apache HTTP

While perhaps less glamorous than some of its distributed computing brethren, the Apache web server remains the workhorse of the web, powering some 70% of all websites and enabling many applications through its extensions for Perl and PHP programming.

Deepak Agarwal, CIO for the Palm Beach County School District, says that’s still the technology from the Apache family he finds most useful, because “It supports a variety of features, many implemented as compiled modules which extend the core functionality.” He also calls the Apache web server “easy to configure and very stable” – valuable qualities for any enterprise environment.

Related Information From Dell.com: Experience the Advantages of Linux.

Want more like this? Sign up for the weekly IT Expert Voice Newsletter so you don't miss a thing!

COMMENTS

  • Aug 26, 2010 | Sarath Reddy says:

    Just curious, How about Apache Jackrabbit & Apache Sling projects ?12 & 13 in the list?

  • Aug 26, 2010 | rich scadding says:

    Nice article.

  • Aug 26, 2010 | Cem GULER says:

    Nice article.
    However; Subversion, Spamassassin and Wicket are Apache projects too.

  • Aug 27, 2010 | Josh says:

    And also Apache QPid, a replacement for IBM's MQSeries / Websphere MQ. It's even better in some respects.

DELL
FM IT Expert Voice is a partnership between Dell and Federated Media. Privacy Statement