
        IMS Open Corpus Workbench (CWB)
        Release 3.5 (PRE-RELEASE)

        Installation Guide


This file describes how to build and install the CWB from source code.  While 
binary packages for popular platforms are available from the CWB homepage

    https://cwb.sourceforge.io/

together with detailed installation instructions, compilation from source code
is strongly recommended for serious use because

 - the resulting binaries will be more compact and efficient than those
   in the pre-compiled packages (due to use of shared libraries),
 - the compiled code can be fully optimised for the platform and CPU it
   is running on, often with noticeable performance gains, and
 - there is less risk of compatibility issues. 

If you encounter a problem that you cannot solve with the information provided
in this installation guide or on the website, you can join the CWBdev mailing 
list

    http://devel.sslmit.unibo.it/mailman/listinfo/cwb

and ask your question there.

There are two companion files to this one with additional information related
to building/installing on specific platforms:

 - INSTALL-WIN for Windows, and
 - INSTALL-MACOS for MacOS (and its predecessor OS X) 
 
These files can be ignored if you are building on a generic Unix platform
(in particular Linux).


        QUICK INSTALLATION

On MacOS, the easiest way to install CWB is via the HomeBrew package manager.
Install HomeBrew from https://brew.sh/, then type

        brew install cwb3

to install the latest stable CWB release.  You can also install the cutting-edge
code from the sf.net repository with

        brew install cwb3 --HEAD

Optionally, run a few basic tests to verify your CWB installation:

        brew test cwb3

See INSTALL-MACOS for further important information and other installation options.


On Linux, the easiest way to install CWB is with the included auto-install script,
which will identify your Linux distribution, install necessary prerequisites via
its package manager, then compile CWB and install it in the /usr/local tree.  In the
root directory of the source code distribution (where this INSTALL file is located),
type

        sudo ./install-scripts/install-linux

then enter your password (you may need to login from an admin account first).  For a
personal installation without admin permissions and other options, read on below.


        PREREQUISITES

 - any modern Unix flavour (must be POSIX compatible)
 - GCC 3.3 or newer recommended (other ANSI C compilers might also work)
 - the ar and ranlib code archive utilities
 - GNU make or a compatible program
 - GNU bison & flex for updating automatically generated parsers
 - the pkg-config program
 - the ncurses library (or a similar terminal control library)
 - the PCRE library (see notes below)
 - the Glib library (see notes below)


        RECOMMENDED SOFTWARE (not essential)

 - the GNU Readline library for command-line editing (strongly recommended)
 - GNU install or a compatible program
 - Perl with pod2man script for rebuilding manual pages
 - GNU less pager for interactive display of query results in CQP 
   (a run-time dependency, strongly recommended)x
 - gzip, bzip2 and other file compression tools in order to enable CWB to
   read and write compressed files (run-time dependency, strongly recommended)


        PREREQUISITE LIBRARIES ON LINUX: A WARNING

If you have installed the prerequisite libraries listed above from source,
then you will have all the files you need. However, if you are on Linux and
have installed the libraries via your distro's package manager, you may have to
install *more than one package per library* to be able to build CWB 
properly. For instance, on Fedora, installing the package "ncurses" is not
sufficient to have the library available for building extra software - 
you also need the "ncurses-devel" package. The details of precisely what 
packages you need vary from distro to distro, but in many cases packages
for building software end in "-devel"  or "-dev".


        BASIC INSTALLATION PROCEDURE

[There are one-step setup scripts for some operating systems: see 
"AUTO-INSTALL SCRIPTS", below. You will also find specific instructions
for Windows and MacOS in files INSTALL-WIN and INSTALL-MACOS.
Otherwise, follow the instructions here.]

Edit "config.mk", selecting suitable platform and site configuration files
for your system (available options are documented in "config.mk").  You
can also override individual settings manually there.  If you cannot find an
appropriate configuration, see WRITING YOUR OWN CONFIGURATION FILES below.

Alternatively, platform and site settings can be overriden by adding
appropriate flags "PLATFORM=### SITE=###" to the end of each of the
make commands below.  This approach is particularly recommended if you
are working from a SVN sandbox rather than a source code distribution.

Now, to compile the Corpus Library, CQP, CQPcl, CQPserver, and the
command-line utilities, type

        make clean
        make depend
        make all

If your default make program is not GNU make, you may have to type "gmake"
instead of "make" (the current Makefile system only works with GNU make and
fully compatible programs).  To install the corpus library, all programs, and
the man pages, type

        make install

Note that you must have write permission in the installation directories in
order to do so (usually the "/usr/local" tree, but site configuration files may
specify a different location with the PREFIX configuration variable).

You are now set to go.  If you are new to the CWB, you should read the 
"Corpus Encoding Manual" and "CQP Query Language Manual" available from
the CWB homepage.  You may also want to install pre-encoded sample corpora
for your first experiments.

If you want to make sure that all automatically generated files are up to
date, you should type

        make realclean

before starting the build process.  This will update makefile dependencies,
the generated bison/flex parsers and all man pages.  Note that this will only
work if the recommended software is installed (bison, flex and pod2man).


        AUTO-INSTALL SCRIPTS

There are now configuration/installation scripts for Unix and MacOS systems;
note that these are single-step ALTERNATIVES to following the instructions
above.

From the main CWB directory (the one containing this INSTALL file), run

        sudo ./install-scripts/install-linux

for a Linux install (or other Unix: SunOS, Cygwin; if no particular system
is detected, then the script will try a generic Unix install).

The install-linux script must be run as root (e.g. with sudo as shown above). 
Here's what it does for you:

 - downloads and installs all prerequisite software packages,
   if you are using a package-management Linux version
 - sets the right configuration platform for compiling
 - compiles CWB from the source code
 - installs the CWB programs to the "usual" place on your system. 
 
After running these scripts, you are ready to start using CWB.

For recent MacOS releases, the install script is semi-automatic.
It requires either the HomeBrew or the MacPorts package manager to be
available. Use the package manager to install the prerequisite external 
libraries, then run

         sudo ./install-scripts/install-mac-osx

This script will check whether all prerequisites are satisfied and prompt
you to install missing libraries; then it will compile and install CWB.

However, it is recommended to either install CWB via the HomeBrew package
manager or follow the step-by-step instructions in the separate file
INSTALL-MACOS for a trouble-free install process.

These scripts both depend on the autoconfigure script to detect the platform.
You can alternatively use this script yourself to detect the correct default 
settings to use. Running the autoconfigure script as follows will print
out variables to append to "make" commands when you run them:

        ./install-scripts/config-basic

This removes the need to manually edit "config.mk", and allows you to take a
shortcut: you can go straight to compiling.

Note that the autoconfigure/auto-install scripts may not work if you are
using Linux on an Opteron system. The autoconfigure script, and the 
auto-install script for MacOS, are likewise unable to distinguish 
most variants of Darwin; they will select a generic natively tuned 
configuration and focus on detecting whether prerequisite libraries are
provided by the HomeBrew or MacPorts package manager.
In some cases, manually editing "config.mk" may be better to select the
most suitable platform and site configuration.


        WRITING YOUR OWN CONFIGURATION FILES

If you cannot find a suitable platform and site configuration files, or if 
you need to override some settings and expect to install future CWB releases
on the same system, you can write your own configuration files.

All configuration files can be found in the "config/platform/" and
"config/site/" subdirectories.  A listing of configuration variables with
short usage explanations can be found in the template files (aptly named
"template") in these directories, which provide good starting points for
your own configuration files.  In many cases, the easiest solution is to 
make a copy of a sufficiently similar configuration file and add your own
settings, or to inherit from this configuration with an appropriate 
"include" statement.  The "linux-*" and "darwin-*" configuration files in
the standard distribution are good examples of this strategy.
It is recommended that you store your personal configuration files in a
separate directory outside the CWB tree, so you can easily re-use them with
future versions of the software.  You just have to modify the "include"
statements in "config.mk" to use absolute paths to your configuration files.
If your configuration files inherit from standard configurations, use include
paths of the form "$(TOP)/config/...".


        BUILDING BINARY RELEASES
        
If you want to create a binary package for your platform, select a suitable
"*-release" PLATFORM and SITE configuration, then type

        make release

This will install the CWB locally in a subdirectory of "build/" and wrap it
in a ".tar.gz" archive for distribution.  The filename of this archive (which
is the same as the installation directory) indicates the CPU architecture and
operating system which the binary package has been compiled for.

The release configuration files ensure that binaries are statically linked
and will work on other machines without the development environment and DLLs.
You will also need to make sure that prerequisite libraries are available as
static archives (*.a), not just as DLLs for dynamic linking (*.dylib, *.so, *.dll).
The relevant static archives will be included in the binary release for users
wishing to link their own C code against libcl.
For Windows and MacOS it is strongly recommended to follow the specific 
instructions and recommendation in the respective companion INSTALL file.

Note that individual settings for installation directories (except for the
general PREFIX) and access permissions will be ignored when building a binary
release.


        BUILDING SOURCE RELEASES

In order to "clean up" the source code tree for a standard source distribution
and create a ".tar.gz" archive, type

        make tarball

This will remove all automatically generated files, and then recreate the 
makefile dependencies, bison/flex parsers and manual pages, so that the CWB can
be compiled from source with a minimal set of prerequisites.


        BUILDING RPM PACKAGES FOR LINUX

In order to create a binary Linux distribution in RPM format, edit the file
"rpm-linux.spec" as necessary, then copy the sourcecode archive (whose precise
name must be listed in the "Source:" field of the RPM specification) into
"/usr/src/packages/SOURCES", and run

        rpmbuild -bb --clean --rmsource rpm-linux.spec

The ready-made binary RPM package will then be available in the appropriate
subdirectory of "/usr/src/packages/RPMS/".  It may be necessary to select the
appropriate Linux configuration (e.g. to build a 64-bit version of the CWB) in
"config.mk" and rewrap the source archive before building the RPM package.
Otherwise, the build process will automatically select the generic Linux 
configuration for standard i386-compatible processors.


        INSTALLING PREREQUISITE EXTERNAL LIBRARIES

The Glib and PCRE libraries are needed to compile CWB. If you are using a 
Linux flavour such as Debian, Fedora etc. then the easiest way to do this is
via your package repository, which should almost certainly include both. 
In fact, the auto-install scripts will actually check that you have got
the packages in question using the package-management tool. So if you are
using the auto-install scripts, you don't need to worry about it. 

Otherwise, PCRE and Glib are available to download from these addresses:

        http://www.pcre.org/
        http://www.gtk.org/download.html

Use the instructions included in the source code downloads to build the 
libraries.

For Glib, at least version 2 is required.

For PCRE, at least version 8.10 is required. You must use a copy of PCRE
which has been compiled with Unicode (UTF-8) support and Unicode
properties support. (You can find out whether this is the case using the
pcretest utility with the -C option.) Note that the new-and-improved PCRE2
library (PCRE v10.0.0 and up) is NOT a suitable substitute. 

We strongly recommend to use a recent version of PCRE v8 (>= 8.32) with JIT
compilation enabled (check with "pcretest -C") in order to ensure good
performance. For complex regular expressions, this will make the lexicon
search A LOT faster (see http://sljit.sourceforge.net/regex_perf.html).
Note that standard packages shipped with various Linux distributions may
not include JIT support and you may have to compile PCRE yourself for
optimal performance. The following Linux distributions appear to include 
a PCRE version high enough to include JIT:

        Ubuntu 14.10 ("Utopic") and newer
        Debian 7 ("Wheezy") and newer
        Fedora 20 ("Heisenbug") and newer

For info on PCRE and Glib on Mac OS X, see the specific section above for 
that OS.  Note that PCRE JIT does not support Apple Silicon (M1 etc.), 
so regular expression performance may be unsatisfactory on this platform.

The CWB build process makes the following assumptions:

 - that the location of header/library files for Glib can be discovered
   using the pkg-config utility (should be the case if you have used the
   standard installation procedure);

 - that the location of header/library files for PCRE can be discovered
   using the pcre-config utility (which should also be the case if you
   have used the standard installation procedure).

If you want to use your own copy of Glib or PCRE instead of the version
provided by a standard package, set up your search path so that the 
appropriate pkg-config or pcre-config binary is found first.

If you cannot provide installation details through pkg-config and
pcre-config, quite a bit of tinkering with the Makefile includes will be
needed to make things work.


        PACKAGE CONTENTS

Makefile                top-level makefile
config.mk               makefile configuration
definitions.mk          standard settings and definitions for make system
rpm-linux.spec          configuration file for building binary RPM packages
install.sh              a GNU-compatible install program (shell script)
README                  the usual open source "boilerplate"
INSTALL                 this file
INSTALL-WIN             supplement to this file on Windows-specific matters
INSTALL-MACOS           supplement to this file on MacOS-specific matters
COPYING                 licence info (a copy of the GPL v2)
CHANGES                 change log
AUTHORS                 info on who wrote CWB, copyright, etc. 

doc/                    some technical documentation

config/                 platform and site configuration files 
  config/platform/        compiler flags and settings for various platforms
  config/site/            site-specific settings (installation paths etc.)

instutils/              utilities for installing and binary packages

install-scripts         shell scripts that automate building/installing for
                        common systems

cl/                     corpus library (CL) source code
cqp/                    corpus query processor (CQP) source code
CQi/                    cqpserver source code (inc client-server interface CQi)
utils/                  source code of command-line utilities

man/                    manpages for CQP and the command-line utilities

editline/               local copy of the CSTR Editline library (no longer used)
