Where should I place my .Rprofile file?

Sometimes configuration can be a pain – especially on Windows where many things can happen in the background. According to the R documentation the .Rprofile file can be placed in three different places and will be read during startup.

  • Rprofile.site – this is usually within $R_HOME/etc/Rprofile.site and does not exist on factory installations aka fresh installations
  • $R_HOME/.Rprofile – this file is in the user homedirectory
  • .Rprofile – this file is in your current working directory
The .Rprofile file can take general options that you want to inject into your R session every single time to provide default values. This could be default options, changing the prompt, output printing options and many more.

Find the location

Finding the location for the second .Rprofile file can be challenging. It’s pretty easy for Linux as .Rprofile usually lives happily in your $HOME directory.

For Windows things may be a bit more complicated. The R documentation FAQ mentions a method for finding the correct location. Start R and type the following things in your R console and R will print out the path to your R_HOME and that’s where the .Rprofile file belongs to.

> Sys.getenv("R_USER")
[1] "//fshome/home$/rhaen/My Documents"

As you can see in this example from my Windows laptop the local IT departments prefers to have the home directory on a file server. Therefore the full path to the Rprofile will be:

//fshome/home$/rhaen/My Documents/.Rprofile

Some thoughts about the location

You have to keep in mind that there is always a certain load order for the Rprofile files. The general load order is the following:

  1. Rprofile.site
  2. .Rprofile in the current directory
  3. R_HOME/.Rprofile in your R_HOME directory

The load order matters and we can make use of that order for specific tasks. I tend to store password/oauth tokens within my .Rprofile file inside my projects. If I work on a project I’ll switch to project directory and start my R from that directory. That causes R to load the project oriented .Rprofile with the necessary tokens in it.

All other things which are not project specific are inside my R_HOME/.Rprofile file. Usually I store things like common used options like: R_LIBS, PAGER and my personal R prompt in there. Common used packages can be loaded from there, too.

Can I use the SHELL/OS environment for settings?

Being a maintainer for software and all different kinds of environments I think it’s not the best way to use the .bash_profile or the .bashrc for settings for the R interpreter. The reason for this pretty simple. Those shell specific files are not always processed and you need to have a good unix background to understand when those files are being read and when not. Especially for “production grade” deployments with R I would highly recommend to have a dedicated configuration file and load it – or use a .Rprofile file inside your R application/deployment.

While we are at it: use a version control system for this. It doesn’t matter which one as long as you stick to it!

Generating openssl key and cert via shell script

Sometimes certain programs require a PKI (Public key infrastructure) to secure data transfers or to encrypt files. The Open Source backup software Bacula is such a animal which uses a private key and a self signed certificate to encrypt the backup data on the client. It adds a so called master key, too. In case of disaster you can use the master key to decrypt your backup data, too.

The idea is great and it’s very simple. However, what’s the best way to automate the installation on the clients? puppet comes to the rescue, of course, all we need a snippet which can be used to generate the key and the certificate on the client without any interaction.

 

Here is a small shell snippet (called bacula-keygen.sh) which can handle this.

#!/bin/sh
#
# This program creates a RSA key and a certificate
# on the command line without asking questions
#
# Useful for Bacula client-fd installation.
#
# You can override the default values in the ENV
# aka:
# $ DAYS=5 ./bacula-keygen.sh
#
# Which will create a certificate which is valid
# for 5 days.
#
# Ulrich Habel <rhaen@pkgbox.de>

set -eu

: ${CERT="`hostname`.pem"}
: ${KEYLENGTH=4096}
: ${DAYS=3650}
: ${COUNTRY="DE"}
: ${STATE="."}
: ${LOCATION="Starnberg"}
: ${COMPANY="ACME Ltd"}
: ${DEP="ACME Support Coordination"}
: ${CN="`hostname`"}
: ${MAIL="support@acme.org"}
/usr/bin/openssl req -new -x509 -newkey rsa:${KEYLENGTH} \
-nodes -days ${DAYS} -out ${CERT} -keyout ${CERT} << EOF
${COUNTRY}
${STATE}
${LOCATION}
${COMPANY}
${DEP}
${CN}
${MAIL}
EOF

You can download this script here: RAW on GitHub, as GIST on GitHub

If you are using this kind of encryption your backup server will store encrypted data and will never see unencrypted files – which is a great thing in terms of PCI compliance.

There are more useful blog links like:

Using git with custom CA certificates (https)

Well, well – there seem to be some misunderstandings in the world of git and its usage with SSL certificates which doesn’t belong to standard CAs. Here is a quick way how to tell git where to look for certificates. Please note – this might vary from operating system to operating system, I’ll explain it for CentOS / RHEL based systems, however it should work everywhere with a litte adaption. Leave your solution in the comments.

 

Allmost every git distribution relies on libcurl for the HTTP transactions which is a good thing. In CentOS the libcurl is checking the certificates against certain locations. Usually there are:

  • Initializing NSS with certpath: sql:/etc/pki/nssdb
  • CAfile: /etc/pki/tls/certs/ca-bundle.crt

The way this works is to copy your CA file (the exported certificate from the CA which signed your certificates) to /etc/pki/tls/certs.

The only thing we have to do now is to tell git to add a CApath to its configuration by running:

$ sudo git config --system http.sslCAPath /etc/pki/tls/certs

There you go. If you are running your git command now, libcurl will check the path /etc/pki/tls/certs for other certificates, too – find yours and will be able to verify the custom certificate. Of course, this works with the cacert.org based certificates, too – just place them into your certs directory.

Your can turn on debugging by prefixing your git command with the option GIT_CURL_VERBOSE. So the example of a clone will be:

GIT_CURL_VERBOSE=1 git clone <url>

This way you can see exactly what git is doing with the verification. As this is a configuration it will survive the next updates, nice, eh?

 

Oh, and please stop writing turning of SSL verification if you just lack the understanding…(same with SELinux btw – check this article for help)

 

Getting a grip on PHP

Sometimes things in your life are changing rapidly. Due to several reasons I want to look into the worlds of PHP. Being a Perl guy for a long long time I’m wearing my Perl glasses and started to look into the different aspects of the language.

 

When I started to look at PHP I wasn’t interested in the old versions – 5.4.11 is the version to look at, everything else is history. I wanted to start with OO from the beginning and I wanted to start in a testing approach. TDD was all the way to go, how can I define my tests, how can I write a class, what is the best way to ship it, how do I structure my code/applications/modules.

 

And finally – how do I write code? Is that the emacs, vi approach or should I go for NetBeans, Eclipse or other IDEs?

 

This will start a short series about some PHP related articles and my attempt to learn a new language. I will start to use the tag PHP for the articles, hopefully I can manage the separation so that the Perl RSS aggregators are not picking up the PHP content – I will notify them.

 

If you want to follow the blog with all the contents use this feed: All articles

 

If you want to see the Perl stuff use this feed: Perl related

 

Want to see the PHP stuff? Take this route: PHP related

Testing the scraper – with Perl (of course)

There has been quite some feedback about the last blog article on web scraping using Web::Scraper. I’ve promised to put up a GitHub repository but I didn’t publish the link so far. So here it is – the full scraper application for the Tagesschau::Video::Asset module.

 

The distribution is CPAN ready – so it has all the documentation which is needed, it has a Makefile.PL, it has tests and it has an example. All you need to get started is to direct your browser to the repository which is listed above and you are ready to go.

 

Here is something which was rather interesting to me. I am scraping a webpage which is being produced by a content management system, however, if the layout changes or the CSS classes are changing there is a chance that my module will break. I therefore decided to have a small check inside the module. Usually I return a data structure with the correct data after the scraping was successful. I decided to fail inside the module when you get something back which differs.

 

Usually the structure looks like this:

 

 

$VAR1 = {
          'formats' => [
                         {
                           'link' => '...',
                           'format' => '...'
                         },
                         {
                           'link' => '...',
                           'format' => '...'
                         }
                       ],
          'timestamp' => '...',
          'story' => '...',
          'headline' => '...'
};

 

 

 

So you do know before the call what you’ll get back if the call and the parsing is successful. I decided to implement a quick check for the keys of the hash inside the code. If one of the keys is missing the whole thing is going to fail. I am not sure if this a great idea, however, it will certainly help you to detect changes inside the web page.

 

 

for (qw/formats timestamp story headline/) {

  croak “Missing key from scrape ($_)”

    if !exists $res->{$_};

}

 

 

Please provide some feedback about the idea. Do you think it makes sense to validate the return before you are returning from the subroutine?

 

I’ve also had some great fun in writing the tests. I never messed with Test::Deep. Test::Deep provides a great way to test multilevel data structures. I am providing a very small part of the original webpage to run the tests. So there is no need to parse the actual data on the web page when you are installing the module, you can use the example asset file. The test actually runs the calls against a file:/// resource, parses aka scrapes the html document and returns a data structure. Here is the test inside the basic.t test file.

 

 

cmp_deeply(

  $res,

  { timestamp => ignore(),

    formats   => ignore(),

    headline  => ignore(),

    story     => ignore(),

  }, “Testing data structure”

);

 

This is redundant to the check we’ve seen earlier. However, I just wanted to have a start – now I can use this as a personal reference. Yep, I’ve looked into Test::Deep – at least I’ve read the modules pod page.

Scraping the web (Web::Scraper) and CSS

There is more than one way to extract contents out of a webpage. The CPAN module Web::Scraper provides a great way to scrape (hence the name) contents out of webpages. It makes use of CSS selectors or XPath queries to extract the data out of the HTML content. I prefer to use CSS selectors as I am already familar with them. The module provides decent documentation, however, examples are always better to understand. CSS selectors are common in the world of HTML. Usually you are using them to describe certain elements in a HTML document and assign CSS styles to them. However, with Web::Scraper you can use selectors to navigate through the document.

 

 

The challenge

There is a webpage of the German national broadcaster which hosts video streams of reporters from all over the world in video blogs. Unfortunately they don’t provide a RSS feed to the video streams. So every time you want to watch a new episode you can use the RSS feed to access a webpage with the new stream. I decided to pull their RSS feed, query the links inside it and check for video streams. I wrote a small wrapper around this so I can use a small Web::Simple based application to generate a RSS feed with video feeds included – which I can subscribe to. Voila, there is my RSS feed with videos for my tablet.

 

 

The solution

Here are some elements which I wanted to extract and I’ll show you the CSS selector which I used to navigate through the content.

 

 

Here is the first snippet of the webpage that I want to extract:

 

<html>

  ….

  <span>Videoblog …</span>

  <h1></h1>

  <p></p>


There is only one h1 headline inside the document. I wanted the content of the headline and the content  of the paragraph which is following after the h1 tag. Here is the selector to get in the language of Web::Scraper.

 

my $scrap = scraper {

  process ‘h1′,   headline => ‘TEXT';

  process ‘h1+p’, story    => ‘TEXT';


The h1 tag is pretty self explanatory. The notation h1+p describes the first p tag after the h1 tag. The ‘TEXT’ says that I want to extract the content which is enclosed by the tags.

 

The webpage has some download links for video streams on it. I wanted to grab the URLs of the streams and the title of the corresponding stream. Here is the HTML snippet.


<ul id=…>

  <li>

    <a href=”” title=”…”>

      <span></span>

    </a>

  </li>

  <li>…multiple times one per stream ….</li>

</ul>


Well, things are getting a little bit complicated here. I am interested in the href attribute and in the content enclosed by the span tag. As there are many li elements, one for each stream we need to store the return value in an array. So, let’s look into the CSS selector.


 

process ‘a.downloadLink’, “formats[]” => scraper {

  process ‘a.downloadLink[href]’, link   => ‘@href';

  process ‘span.title’,           format => ‘TEXT';

};

 


Huh, ok – so let’s proceed step by step. The selector a.downloadLink selects every a tag which is of the class downloadLink. I am using the array formats[] to store the return values of the included scraper. This scraper selects the href attribute (‘@href’) and text (‘TEXT’) of the span element of the class title.


The result is a data structure which can be processed. Here is an extract of it:

 

 

$VAR1 = {

  ‘formats’ => [

                 {

                   ‘link’   => …,

                   ‘format’ => …,

                 },

                 {

                   ‘link’   => …,

                   ‘format’ => …,

                 },

               ],

  ‘story’    => “…”,

  ‘headline’ => “…”

};

 

The display of the source code and the data structure is nowhere near perfect in this blog engine so I’ve provided a small gist for it on my GitHub account. You can find it here.

 

After you walked through the source code of the example you’ll know how to extract portions of a webpage using CSS selectors. You can use Mojolicious for this purpose, too! The Mojo::DOM::CSS module provides everything you need. Which one you choose is just a matter of subject. If you are using long HTML documents I would go for Web::Scraper which might be faster due to the fact that it can use LibXML if properly installed (plus you’ll have validation support if needed).

 

Conclusion

CSS selectors provide a powerful way to parse HTML/XML contents and to extract data out of it.  Web::Scraper or Mojo::DOM::CSS are two modules on CPAN which are well maintained, documented and have an active community around it. I think Web::Scraper deserves to be included in the toolchain of all people working with Perl.

 

Here are some more links with some useful material:

What are the key elements of a Perl workshop?

There is a Linux workshop in Augsburg in March 2013. I used to give presentations there, mostly about Open Source projects and thinking, some about Perl aspects, too. This year I want to run a Perl starters workshop to get more people into the modern way thinking and coding in Perl. I am not sure if I can run a practical sessions with laptops, probably it will be a mixture of practical things and theoretical things. The workshop time slot is 2 hours.

 

Image that you will take part in the workshop as someone who is starting to learn Perl. What would you expect to learn? What do you want to hear about?

 

Please leave your feedback in the comments, I have to submit the workshop until 13th of Jan 2013.

Old Perl books – still useful?

Here is an oldie. I found it when I was cleaning my bookshelf. Is a short reference about Perl 5.6 still usable in 2012 after Perl 5.16.2 has been released? It depends on you usage of such references, do you like to have something on your desks which looks nice, something which avoids that the wind will mess with your sheets so called paper weight? Or do you prefer current references. I am pretty indifferent about such references. The code inside in it will still work, even in 2012. If you are trying to compile stuff from the original K&R C book the current C compilers will probably remove your user account and lock down the root access to the server. In Perl, however, nearly everything still works – starting from the geological stable Perl versions ( (c), mst ) up to the current Perl version. This leads to the question about deprecation of features. Brian d. Foy wrote a nice article about this (Perl 5.12). Make sure to look into the perldelta pod pages, usually a few things are being listed as deprecated from version to version (you already did this,right?).

So I am still indifferent about the book. Keep it or trash it? Maybe it's worth keeping it – to get in touch with the ancients. You decide, what's your opinion about this?

Perl 5 - Kurz und gut
Perl 5 Reference