Pages

Wednesday, December 26, 2012

Run a script with lowest priority

When doing low-priority tasks like backups are fixing permissions on a cronjob, it is a good idea to modify the niceness of the script. By using ionice and renice, you can ensure it won't get in the way of your important programs.

Friday, December 21, 2012

Choosing the best wireless router for a small office

At my office, we needed a new router. With more than 50 devices that are quite heavily using the network, we we crushing our basic personal router.

After some unsuccessful searches, I asked the question on Reddit and I worked from there.

Our setup:
  • 20~25 wired computers
  • 20~25 wireless computers
  • ~25 smartphones
Our needs:
  • ~500 GB download/month
  • ~300 GB upload/month
  • Google Apps (Docs, Spreadsheet, Gmail, etc.) – A lot of AJAX requests
  • SSH – Needs very low latency
  • Skype – High bandwidth usage
  • Dropbox – Moderate bandwidth usage
  • Sending and receiving >5GB files
  • Syncing thousands of files over FTP
We had for routers:
  • Dlink DIR-655 acting as the gateway: http://goo.gl/TXXs0[1]
  • WRT54 acting as a router and a wifi AP, configured with DD-WRT: http://goo.gl/Wu89u[2]
Unfortunately, it seemed like the network was always slowing down for no reason and we needed to restart the router a couple time per day. We tried flashing the routers, switching them around and doing the updates, they just can’t support the load.

I knew I wanted some sort of QoS to prioritize usage. Ex: Skype > SSH > Browsers > FTP and at our location, good quality connections at a decent price is pretty hard to find so I was looking forward to a Dual-WAN setup.

Upload speed

When you are trying to improve your Internet connection for a lot of people, it will most likely be your upload speed that will be the bottleneck. For 30 heavy users, try to have at least 5mbps in upload

Network link speed negotiation

Router often use a non-standard algorithm to detect the speed of the link between the router and the modem, 10, 100 or 1000 mbps and sometimes, it can’t get it right. Fail to detect the current speed and you may end up talking at 10mbps to a modem trying to work at 60mbps. More often than not, you want this to be set at 100mbps.

AP

The router I chose has wireless antennas, but I you are 30 people on the same antennas, you will experience come slowdown. Consider installing some access-points over the place. You will have to wire them together, but it is much simpler than wiring everyone.

Router

I had heard good comments about Draytek and so Reddit confirmed it. They are pretty solid, easy to configure and a much cheaper alternative than their Cisco counterparts. I finally went for a Draytek 2920n. This is about 250$, supports balancing and failover with a second Internet connection and it is managed in about the same fashion as a common router. The big difference is that it  is much more powerful and can handle a traffic of a hundred devices without struggling. It was a godsend, we didn’t even need to upgrade the Internet connection, check it out.

Other options

As @devopstom suggested me, other great alternatives are Peplink, Netgear, HP and Meraki.

Meraki is especially cool with all the remote management, very easy management and automatic updates, but they were a bit too expensive for our needs.

Monday, December 17, 2012

Varnish 3 configuration

Shared caches are pretty strict when it comes to caching or not a request. This is good, otherwise it would cache requests that are not meant to be cached.

However, there is some things that can be done to improve cacheability:

  1. Ignore Google Analytics cookies
  2. Remove empty Cookie line
  3. Normalize Accept-Encoding header
  4. Allow replying with a stale response if the backend is slow
Bonus:
  1. Remove some headers to reduce header size and hide some details about the server (security).
  2. Add a debug header to help understand why a request is cached or not.


How to properly create and destroy a PHP session

When working with shared caches like Varnish or Nginx, cookies will kill everything you are trying to do.

The idea is that since a cookie can be used by the backend to modify the reply, like being logged in as a user, the shared does not take the chance and refuse to cache it. This behaviour can be modified, especially in the case of Google Analytics cookies, but for the PHP session cookie, you will typically want it.

However, it is important, for a useful usage of your shared cache, to only start a session when you really need it and destroy it when it is not needed anymore. PHP’s session_start specifically mentions that is does not unset the associated cookie.

So the idea is to start a session when you need it; for example, in a login page. This will send a header to the client, setting a cookie and the cookie will get sent back on every request. Therefore, in some global file, you can detect this cookie and reload the session. At last, on the logout, clear everything.


Wednesday, December 5, 2012

Generate Lorem Ipsum using loripsum.net API

loripsum.net describes itself as “The ‘lorem ipsum’ generator that doesn't suck”. It is the same text you are used to see over and over except that it is decorated with various HTML tags like <ul>, <b>, etc.

Seeing this, I thought it would be nice to use it to generate data fixtures that are meant to be used as blog posts, pages, comments, etc. I started a prototype for Wordpress, I should come up with a post about it soon enough, but in the mean time, I thought it would be valuable to share this little interface.

It is pretty straight forward and the options are documented in the code itself. Have fun!

Wednesday, November 21, 2012

Automatic LESS compilation using mod_rewrite and lessphp

I have been using LESS (and before SCSS) and I enjoy it, but I have to admit that I was spoiled by Symfony, Twig and Assetic that compile them automatically and I am really not looking forward to compile all my assets. I could use an application that watches a directory and compiles them on the fly, but there’s none that works on all OS and you still have to fire them up and configure them each time you start a project or reboot your computer.

Using lessphp, I made a simple script that will compile your LESS files, do some caching, gzip, etc.

It works by checking that a .css file exists. If not, it checks if the corresponding .less exists. If so, it fires up lessphp.

It is not perfect, but is worth using on simple projects.

Tuesday, November 20, 2012

The little quirks of using Twitter Bootstrap

Twitter Bootstrap is awesome to get started quickly on a custom layout. It had a lot of features that are very good best-practices like intuitive colours for buttons, error highlights and some good normalizations. However, like any CSS framework, it has its quirks. Here are some rules that are applied on elements without any classes and the issues they may cause. Keep in mind that this Bootstrap v2.1.1 and it in no way an exhaustive list nor a bug list. It is simply some things to put some extra attention on.

input[type="file"] {
  height: 30px;
  line-height: 30px;
}
On Chrome and Firefox, I had problems where the Browse button was not correctly vertically aligned.

h1,h2,h3,h4,h5,h6 { text-rendering: initial; }
May cause some troubles with custom fonts.
I noticed this problem with Google Webfont Advent Pro.

img { max-width: 100%; }
May cause some troubles if your wrapper box is smaller.
You may also see flicker and reflows during loading while the size is readjusting.
To correct this problem, you should always specify height and width attributes, but sometimes it is not possible.

button, input { line-height: normal; }
Will cause button and input not to inherit the line-height specfied in parent.

h1 { font-size: 36px; line-height: 40px; }
Occurs with all headers. Since the line-height is not in em, you will need to update it manually if you change the font-size.

input, select, textarea { 
    &[readonly], &[disabled] {
      cursor: not-allowed;
    }
}
This is not a problem in itself, but you may not want this cursor.
I stumbled upon it when I was trying to style an input as a seemlessly inline element.


And of course all of the rules for text-shadow, border-radius, box-shadow, etc.
All in all, I found myself using rules like theses quite often.
The exact rules will depend on your usage, but you can see the point.
They are not specific to Twitter Bootstrap.
.reset-font {
    font-size: inherit;
    font-family: inherit;
    line-height: inherit;
}
.reset-size {
    height: auto;
    width: auto;
    max-height: auto;
    max-width: auto;
}
.reset-effects {
    text-shadow: none;
    box-shadow: none;
    border-radius: none;
}

Monday, November 19, 2012

Generate a multi-resolution favicon using ImageMagick

PNG format for favicons are supported by most browsers, but as you are all aware, the current state of the Web implies we must not only develop for "most browsers".

ICO favicons are very well supported and offer a bonus feature: multiple resolution in a single file. This way, the browser decides which resolution he prefers. This is notably useful in the era of iPads and Macbooks with high resolutions (Retina).

Here is a simple script to resize an image multiple times and combines them using ImageMagick. Should work with all supported formats of ImageMagick

Thursday, November 1, 2012

Simple templating system using Bash

I often have to deploy several config files that are very similar. Things like Apache VirtualHost and PHP FPM pools. The solution to this kind of problem is to use something like Puppet or Chef that will apply a real template engine and much more like creating folders and stuff. However, this kind of solution is often lengthy to implement and prevents you from doing some quick editing on-the-fly.

Hence, for very simple needs, I started using simple scripts that would only replace variables and give me a basic template to start with. This is however not very flexible and needs to be adapted for each case. And so I did a templater that replaces variables with the value in the environment. It also supports defining default values and variable interpolation.

Example with Apache + FPM

{{LOG_DIR=/var/log/apache2}}
{{RUN_DIR=/var/run/php-fpm}}
{{FCGI=$RUN_DIR/$DOMAIN.fcgi}}
{{SOCKET=$RUN_DIR/$DOMAIN.sock}}
{{EMAIL=$USER@$DOMAIN}}
{{DOC_ROOT=/home/$USER/sites/$DOMAIN/htdocs}}
<VirtualHost *:80>
  ServerAdmin {{EMAIL}}
  ServerName {{DOMAIN}}
  ServerAlias www.{{DOMAIN}}

  DocumentRoot "{{DOC_ROOT}}"

  <Directory "{{DOC_ROOT}}">
    AllowOverride All
    Order allow,deny
    Allow From All
  </Directory>

  AddHandler php-script .php
  Action php-script /php5.fastcgi virtual
  Alias /php5.fastcgi {{FCGI}}
  FastCGIExternalServer {{FCGI}} -socket {{SOCKET}}

  LogLevel warn
  CustomLog {{LOG_DIR}}/{{DOMAIN}}.access.log combined
  ErrorLog {{LOG_DIR}}/{{DOMAIN}}.error.log
</VirtualHost>

Invocation

DOMAIN=cslavoie.com ./templater.sh examples/vhost-php.conf

Help

If you add the -h switch to the invocation, it will print all the variables and their current values

And the code is available on GitHub.

Friday, October 26, 2012

Using a public CDN versus hosting custom libraries

The plus side of using a public CDN

It is becoming common practice to use public CDNs to host some assets like jQuery and Bootstrap. The advantage is huge: less bandwidth usage for you, fully configured CDN will everything like Gzip, cache headers, etc. and a chance that even on the first load of your website, the user will already have the asset in his cache.

The down side

However, those libraries often come fully bundled. The whole Javascript for jQuery UI is whopping 230KB. This of course is cached, gzipped and everything, but you still have to parse this Javascript. Twitter Bootstrap also runs a lot of onLoad functions to bind automatically. Almost all libraries like those have builders that let you choose which part you need, reducing the load quite a bit. Moreover, you increase a lot the number of HTTP requests while you could be bundling all JS/CSS libraries together.

Testing the difference

So I decided to do a comparison of the three options: using a public CDN, hosting everything yourself bundled in one JS, one CSS (Hosted) and hosting only a subset of the features (Custom). The demo was done using Assetic to minify and concatenate. It is available here and it source code here but the subset looks like this:
  • jQuery
  • jQuery UI (including core and dependencies)
    • Dialog
    • Datepicker
  • Twitter Bootstrap
    • No responsive
    • Alert
    • Button
    • Modal
    • Tooltip
    • Tab
On the demo, you may notice Javascript errors, I may have made some errors in the dependencies order, but it does not change the idea.

Then, the tests are done fully cached, 304 requests (when forcing refresh) and without any cache (No cache)Exec time is calculated using Google Speed Tracer, it includes parsing of JS/CSS and some JS execution. Keep in mind that the DOM was almost empty so result could scale a lot on crowded pages.

Results breakdown

Custom  Hosted Public CDN Half/CDN
Gzipped size 82.91 KB 110.13 KB 128.93 KB 36 %
Exec time 15 ms 22 ms 22 ms 32 %
Cached 110 ms 130 ms 150 ms 27 %
304 125 ms 155 ms 195 ms 36 %
No cache 220 ms 240 ms 250 ms 12 %
* The measured time is onLoad event, so all files are downloaded, script are executed and the browser is ready.

Developer tools screenshots



Conclusion

Public CDN are really effective, but if you configure your CDN well, they are pretty on par. However, the difference is there and visible. As the page gets bigger and you add your own assets this idea quickly begins to become important. Also, don’t be too quick on calling micro-optimization because in the order 200ms, it is clearly noticeable.

CDN resources

Thursday, October 25, 2012

Mind blowing jQuery plugin: Scroll Path

Joel Besada (@JoelBesada) did a jQuery plugin that lets you define custom scroll paths, including rotations and custom scrollbars.

The technology is not very new, a lot of video presentations use this kind of stuff. However, he did a dead-simple implementation, available on GitHub which allows us to do it directly in the browser.

The result is stunning; it is smooth, mostly compatible and pretty flexible. This is a wonderful alternative to building a PowerPoint to show your ideas.

I have not used it yet, but it seems pretty easy. Check out the live demo!

Wednesday, October 24, 2012

Automatic cache busting using Git commit in Symfony2

Cache busting

Cache busting is to rewrite a URL for an asset (JS, CSS, images, etc) so that is unique. This way, you can cache them for as long as you want. If the URL changes, it will use the new file.

Symfony

Symfony includes a cache busting mechanism which appends a GET parameter to your asset, but you must remember to increment it and it could quickly become a pain.

Incrementing using Git version

The idea is to use the current Git commit as a version number using git rev-parse --short HEAD. We can call this in the command line using PHP. Don’t worry, in production mode, Symfony caches the compiled configs, so it won’t be checked until you clear the cache.

Caveats

  • All assets will be reset with every commit. Not a big deal since when you deploy, you usually touch a lot of assets.
  • By default, Symfony uses a GET parameter, which is not cached by all CDNs.
    However, Amazon CloudFront now supports them.
    Otherwise, it is possible to rewrite using a folder prefix, but it can get tricky.
  • You must be using Git (!), including on your production server. Otherwise, it could be possible to achieve something similar by adding a Git hook on commits, writing version in a file, and loading this file instead.

Friday, October 12, 2012

bcrypt with ircmaxell and how to use it


It is common knowledge that md5 is not secure for password hashing. It is almost worst than plaintext because it may falsly induce the impression of security. People aware of this usually also consider sha1 insecure and straightly go to sha256. Some techniques exist to add further security like adding a salt or hashing multiple times but ultimately the flaw remains: those methods are too quick. If you can hash a password in a fraction of a second, a standard bruteforce can as well. That’s why strong password security involves slow algorithms.

I was aware of all those principles, but @ircmaxell, contributor to PHP, made a video on password hashing with a nice comparison of different hashing functions and it struck me how quickly even a sha512 is computed quickly. Before, I often considered bcrypt as a nice feature to add to a backend but I now realise it is a must.



And be sure to check his blog post with the slides and some other discussion (yes, we have the same blog template).

Now, this is all very cute, but Anthony talks about an easy API coming in PHP 5.5 so it will not be usable anytime soon.

Here are plugins/ways to integrate bcrypt into several platforms:

Two good libraries:
But really, it boils down to this:

Wednesday, October 10, 2012

Git tutorial and global configs

I was searching for inspiration for a good Git tutorial and I stumbled across a wonderful resource.

Sure, you can take a week off and read the official Git Book but I wanted to give a crash course where I work so it was a bit lengthy. Don’t get me wrong, the Git Book is wonderfully written and anyone serious about Git should read it at least diagonally. However, if you don’t feel it at the moment, head over at http://www.vogella.com/articles/Git/article.html

Also, here is my setup for gitconfig and gitignore. Just copy gitignore to /etc/gitignore – or ~/.gitignore and change the corresponding path.

Batch update modules or themes of Drupal 6/7 in command line

To update Drupal modules, you need to download manually all modules and this can quickly become tedious. Especially if you have multiple modules and Drupal installations.

I wrote a simple script that parse the module's page and install or update the most up-to-date version.

This works for themes as well. To use it, simply go into a modules or themes folder and run one of the tool.

Valid folders are:
  • /modules
  • /themes
  • /sites/all/modules
  • /sites/all/themes
  • /sites/*/modules
  • /sites/*/themes
However, not that you should not try to update root folders, they are reserved for core modules and are updated with Drupal.

drupal-install-module.sh will install or update one module, drupal-update-modules.sh will batch update all modules in current folder.

Tuesday, October 9, 2012

Verifying DNS propagation

When changing DNS settings, the propagation can take from 15 minutes to two days. Clients and bosses are usually not very fond of this principle so it is often a good idea to be ready to provide a better answer than this.

Finding your nameserver (NS)

Start by finding your nameserver, you should probably already know it. If not, registrar often make them very easy to find. If not, a simple Google search should get you started. You will have 2-5 nameservers and they are usually in the form of ns1.registrar.com.

It is important to get the real information because NS propagation is part of the process.

Query your NS directly

To verify your settings, fire up a terminal and use dig. You can add MX to verify MX records. Basic dig syntax is like this:

dig [+trace] [MX] example.com [@ns1.registrar.com]

In our case, we query the NS directly so we use 

dig example.com @ns1.registrar.com

You should have an answer section giving you an A record, which is you IP address. If you get an error, you server is not configured properly and you can wait as long as you want, it will never work.

Verifying NS propagation

When a domain name is bought, the associated DNS is sent to the root servers. This is usually fairly quick (~20 minutes). By passing the option +trace to dig, it will bypass any local cache and query the root servers directly. You will see 3-5 iterations until you have your answer.

dig +trace example.com

If you get an error, it usually means your registrar has not sent the new informations to the root servers yet or the root servers have not updated their cache. Verify your NS settings with your registrar and wait a bit. More than 30 minutes is very usual and you should contact your registrar.

Verifying world propagation

Online tools exist to test the propagation against several NS around the world. I personally like http://www.whatsmydns.net/. Verify that the information is correct and once 80% of the server are agreeing, you are fairly confident that everyone near you will see the same as you.

Clearing local and domain cache

Most enterprise and routers have a DNS cache to speedup resolution, you can restart your router to clear it up. Otherwise, fancier network will have a mean to do this cleanly.

To clear local cache, it depends on your system.
  • Windows: ipconfig /flushdns
  • Mac: dscacheutil -flushcache or lookupd -flushcache
  • Linux: Restart nscd and/or dnsmasq or equivalent
It may be tricky to get your client to do it though…

Contacting your ISP or bypassing them

If most world servers are correctly answering since a couple hours, you want want to contact your ISP and ask them what’s up. It is not uncommon that ISPs have caches of a couple hours to lower to stress of on their servers.  If they are not very collaborative, you can manually enter a new DNS for your network. Two fast and safe choices:

Google:

  • 8.8.8.8
  • 8.8.4.4

OpenDNS:

  • 208.67.222.222
  • 208.67.220.220

Testing your Regex skills

Regular expressions are powerful, yes. They should not be used for everything, especially not HTML. HTML is not a regular language. As soon as you need some logic, like checking dates, it is out as well. However, it is tredemously helpful for many things. I suggest regular-expressions.info for a very good round-up and a nice collection.

Now that you know that you must not use it everywhere, nothing stops you from trying. For example, when using a text editor, you might want to write a simple Regex to match some HTML just to find where it is in all your code, it does not have to be perfect, just good enough. @callumacrae has made a nice website where he posts a challenge every tuesday. http://callumacrae.github.com/regex-tuesday/

Here are my answers so far. They are not perfect, I have taken quite a few shortcuts, but it should get you started.

  1. Repeated words
  2. Grayscale colours
  3. Dates
  4. Italic MarkDown
  5. Numbers
  6. IPv4 address

WebPlatform.org – When giants get together

Starting to build websites can be very hard, keeping up with the technology is a must and building cross-browsers is a daily crisis. A lot of great resources are available to learn, share and experiment but in the end, one must often spend hours to find a way to make exactly what he wants.

The correct way is always changing. One of the worst is to embed a video. Do we use a Flash player, a video tag, or both ? Perhaps a javascript fallback on top of it ? What about codecs and fullscreen ? These things always evolve, adjusting with new technology and new hacks that are found.

Now, giants have decided to do something about it and help us, poor souls. Featuring big companies and big names, here is the launch video of a new wiki called WebPlatform.org. It is really at an alpha stage, but with all those names on it, they could have as well called it thedocumentation.org.


Also, as a bonus, check out a TED talk from Tim Berners Lee, whom we attribute merely the invention of the World Wide Web: http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html

Wednesday, October 3, 2012

Going further

So, 4 years and a half after having started my bachelor degree at the University of Montreal in Computer Science, I am done. It has been a long and painful process where I struggled with myself about if I was completing it or not. If I knew what was awaiting me, which is a lot of maths and not a lot of programming, I would have gone elsewhere. Don’t get me wrong, this program is very good —but for people who love optimization, maths, theoretical computer science and quantum computing, which isn’t my strength or particularly my field of interest. I guess it is my fault anyway, I should have checked before.

Anyway, now that it is over, it is time to put it into practice. I worked for 4 years in a mid-size Web company and now I decided to associate myself with a rather new startup and hop in head first. I worked with them for a while as a consultant and they do business the way I like it: great company spirit, nice ideas, will of using the new and open-source technologies and decent business hours. Come and see us!

I always told myself that a nice work environment where you feel comfortable and continuously challenged is the most important. Hence, I will continue in the same path, sharing the interesting stuff I find while exploring the different Web technologies and trying to make some sense of all this HTML/CSS/JS/PHP/SQL/WTV.

Also, I finally decided to create myself a LinkedIn and a professional Twitter.

See you on the Interwebs!

Tuesday, September 18, 2012

Protect Webserver against DOS attacks using UFW

Ubuntu comes bundled with UFW, which is an interface to iptables. This is basically a very lightweight router/firewall inside the Linux kernel that runs way before any other application.

Typical setup of ufw is to allow HTTP(S), limit SSH and shut everything else. This is not a UFW or iptables tutorial, you may find a lot of online help to guide you through all your needs. However, I personally had a lot of difficulties to find good documentation on how to protect yourself against HTTP attacks.

A lot of HTTP requests is normal

The problem is that HTTP can get very noisy. A typical Web page can easily have up to a hundred of assets but usually, if you receive 100 requests in a second, it means you are under siege. If you really need to have 100 assets on a single Web page, you need a CDN, not as better server.

Rate limiting

These rules have been mostly guessed through trial-and-error and some search around the Web, tweak to fit your needs. A rate limit of x connections per y seconds means that if x connections has been initiated in the last y seconds by this profile, it will be dropped. Dropping is actually a nice protection against flooding because the sender won't know that you dropped it. He might think the packet was lost, that the port is closed or even better, the server is overloaded. Imagine how nice, your attacker thinks he succeeded, but in fact you are up and running, him being blocked.

Connections per IP
A connection is an open channel. A typical browser will open around 5 connections per page load and they should last under 5 seconds each. Firefox, for example, has a default max of 15 connections per server and 256 total.

I decided to go for 20 connections / 10 seconds / IP. 

Connections per Class C
Same a above, but this time we apply the rule to the whole Class C of the IP because it is quite common for someone to have a bunch of available IPs. This means for example all IPs looking like 11.12.13.*

I decided to go for 50 simultaneous connections.

Packets per IP
This is the challenging part. Due to a limitation that is not easy to circumvent, it is only possible to keep track of the last 20 packets. At the same time, it might add a considerable overhead to track 100 packets for each IPs. While big website may eventually need more than this, like I said, you should take a look in a proper CDN.

I decided to go for 20 packets / second / IP

Configuring UFW

The following instructions are targeted at UFW, but it is really just a wrapper so it should be easy to adapt them for a generic system.

Edit /etc/ufw/before.rules, putting each part where it belongs

Make sure ufw runs and reload everything using ufw reload.

Testing the results

Make sure everything runs smoothly by refreshing your browser like a mad-man. You should start getting timeout after ~15 refreshes and it should come back in less than 30 seconds. This is good.

But if you want to get serious on your tests, some tools may help you putting your server to its knees. It is highly discouraged to use this on a production server, but it is still better if you do it yourself than if you wait for someone to try.

Try those with UFW enabled and disabled to see the difference but be careful, some machines may downright crash on you or fill all available space with logs.
  • http://ha.ckers.org/slowloris/
    Written in Perl, features a lot of common attacks, including HTTPS
  • http://www.sectorix.com/2012/05/17/hulk-web-server-dos-tool/
    Written in Python, basic multi-threaded attack, very easy to use.
  • http://www.joedog.org/siege-home/
    Compiled, available in Ubuntu repositories, very good to benchmark
  • http://blitz.io/
    Online service when you can test freely with up to 250 concurrent users
To confirm that everything works perfectly, SSH into your machine and start a tail -f /var/log/ufw.log to see the packets being dropped and htop to watch the CPU have fun. 

SSH into another machine and start a script. You should see the CPU sky-rocket for a few seconds and then go back to normal. Logs will start to appear and your stress-tool will have some problems. While all this is going on, you should be able to browse normally your website using your computer. 

Great success.

Friday, September 14, 2012

Generate missing Nginx mime types using /usr/share/mime/globs

Nginx comes with a rather small set of mime types compared to a default Linux system

Linux uses a glob pattern to match a filename while Nginx matches only extension, but we can still use every glob in the format of *.ext

So here is a small PHP script converting/sorting/filtering/formatting everything in a nice output.


Wednesday, September 12, 2012

Configure ElasticSearch on a single shared host and reduce memory usage

ElasticSearch is a powerful, yet easy to use, search engine based on Lucene. Compared to others, it features a JSON API and wonderful scaling capabilities via a distributed scheme and the defaults are aimed towards such scalability.

However, you may want to use ElasticSearch on a single host, mixed with your Web server, database and everything. The problem is that ES is quite a CPU and memory hog by default. Here’s what I found through trial and error and some heavy search.

This idea is to give ES some power, but leave some for the rest of the services. At the same time, if you tell ES that it can grab half of your memory and the OS needs some, ES will get killed, which isn’t nice.

My host was configured this way:
  • ElasticSearch 0.19.9, official .deb package
  • Ubuntu 12.04
  • 1.5GB of RAM
  • Dual-Core 2.6ghz
  • LEMP stack
After installing the official package:
  1. Allow user elasticsearch to lock memory
    1. Edit /etc/security/limits.conf and add:
      elasticsearch hard memlock 100000
  2. Edit the init script: /etc/init.d/elasticsearch
    1. Change ES_HEAP_SIZE to 10-20% of your machine, I used 128m
    2. Change MAX_OPEN_FILES to something sensible.
      Default is 65536, I used 15000
      Update: I asked the question on ElasticSearch group and it may be a bad idea, without giving any advantage.
    3. Change MAX_LOCKED_MEMORY to 100000  (~100MB)
      Be sure to set it at the same value as 1.1
    4. Change JAVA_OPTS to "-server"
      I don’t exactly know why, but if you check in the logs, you will see Java telling you to do so.
  3. Edit the config file: /etc/elasticsearch/elasticsearch.yml
    1. Disable replication capabilities
      1. index.number_of_shards: 1
      2. index.number_of_replicas: 0
    2. Reduce memory usage
      1. index.term_index_interval: 256
      2. index.term_index_divisor: 5
    3. Ensure ES is binded to localhost
      network.host: 127.0.0.1
    4. Enable blocking TCP because you are always on localhost
      network.tcp.block: true
  4. Flush and restart the server
    1. curl localhost:9200/_flush
    2. /etc/init.d/elasticsearch restart

Monday, September 10, 2012

Make your dev machine public using a VPN and a proxy

There is many reasons to be needing a machine available publicly:
  • Developing a Facebook App
  • Developing an OAuth authentication
  • Make a quick showcase for your client
Problem is, you probably want it to be your machine because you have all your development tools and IDEs, it is fast and you know it by heart. You could mount a folder using sshfs, but it is only part of the job and it may get very slow for some file editors.

The solution I came up with is to tunnel through a VPN to a public machine and let it proxy the requests back.

You will need

  • A public Linux box with root access
  • A domain name where you can setup a wildcard

Instructions, tested on Ubuntu 12.04

  1. Install apache or nginx and pptpd (you can follow this tutorial for the VPN or this one if you are using ufw)
  2. In you /etc/ppp/chap-secrets file, be sure to specify a fixed address for yourself (4th column)
    • It must fit the IP range specified in /etc/pptpd.conf
  3. Create a DNS wildcard pointing to your server 
    • Ex: CNAME *.dev.lavoie.sl => server.lavoie.sl
  4. Create an Apache or Ngnix proxy to match a server wildcard and redirect it to the VPN IP decided before
  5. Create the same wildcard on your machine to answer correctly.

Security considerations

If you have unprotected data like phpMyAdmin or other websites you are developing, they could be at risk, consider protecting them via a password or an IP restriction.

Configuration example for Apache and Nginx

Monday, August 27, 2012

MySQL becoming more and more closed source, thanks to Oracle

As highlighted on Slashdot and Hacker News, the tests suites and the commit log of MySQL are not bundled with the opensource distribution.

Oracle has an history of sabotaging opensource projects, it does not work well with their business model.

VirtualBox is still alive, but for how much time ?

MySQL is not the only RDMS, but it is the most popular free one. MariaDB claims to be an enhanced, drop-in replacement for MySQL but if was to switch, I would strongly consider PostgreSQL. It is robust, highly scalable, mature, and feature-rich.

PHP frameworks and CMS with support for PostgreSQL

And probably much more.

Harvard cracks DNA storage, crams 700 terabytes of data into a single gram

Two scientists at the Harvard’s Wyss Institute have successfully stored a 700TB of data in DNA. 

To give you an idea of how big this is, it would mean more than 35000 years of HD video of YouTube (720p at 5kbps) into a single gram. HD video is indeed a diffcult thing to store, it is even mentionned in the video. This means it is almost 60000 times more dense than a 3TB hard drive and it should fail a lot less often.

While they were able to generate that amount of data, most of it was duplicated. DNA replication is really fast, encoding the data is however much slower. They don't extrapolate on the read time either. It is not tomorrow that we will have DNA USB keys, but for long-lasting and space-hungry data, it is very promising.

A rather long time ago, tape devices held a similar function. It is now pretty obsolete, but it made the first backups possible.


Monday, August 20, 2012

Convertir le calendrier UdeM au format iCal

Amis de l’UdeM,
L’interface de visualisation de l’horaire des étudiants est vraiment à exploser de larmes
Voici une solution pour convertir le tout en .ics compatible avec Outlook, Gmail, iCal, etc.
  1. Utiliser Chrome
  2. Aller sur le guichet étudiant et consulter votre horaire.
  3. Rendu sur la page de l’horaire, ouvrir le menu développeur de Chrome: Ctrl-Shift-I
    Sur Mac, c’est Cmd-Alt-I.
  4. Cliquer sur le dernier onglet qui s’appelle “Console”
  5. Là où le curseur clignote, coller ceci:

Tout devrait fonctionner automatiquement et télécharger un fichier qui s’appellera download.ics

Si ça ne fonctionne pas, dîtes-moi le !

Monday, August 13, 2012

PHP Coding standards

For more than a year, some influent PHP programmers of the most active of the most active projects in the community have been working on coding standards.

The group is name PHP Framework Interoperability Group and is composed of, but not limited to, authors from these projects:
  • phpBB
  • PEAR
  • Doctrine
  • Composer / Packagist
  • Joomla
  • Drupal
  • CakePHP
  • Amazon Web Services SDK
  • Symfony
  • Zend Framework

Why coding standards?

You may be a fan, for example of naming your functions with underscore or use tab indentation, but really this is not point. The goal is to be able to use code from other authors and projects without having to "fix" the code style to be consistent with your project.

In the future, the group also aims at providing some interfaces so implementations from different can work together.

Accepted standards

  • PSR-0 Code structure, Class and function naming
  • PSR-1 Basic coding standards
  • PSR-2 Coding style standards (mostly whitespace)
See full repository: https://github.com/php-fig/fig-standards

I don’t want to rewrite all my code!

Well, you probably don’t need to.

Chances are that you are already pretty near PSR-0 if you have organized your classes to be autoloaded. If you haven’t, you should really look forward to it, autoloading eliminate the need to require classes, simplifying a lot class dependencies.

After complying to PSR-0, there’s a tool that will do almost all the hard work for you by fixing all the whitespace. It is called PHP-CS-Fixer and is from Fabien Potencier, member of the group.

You can try PHP_CodeSniffer, but personally a find it a pain to use because it only 'validates' and make some errors. It should probably need a rewrite.

Some editors have plugins:
You can also add a Git commit hook to patch on-the-fly.

Further reading

Great blog post from Paul M. Jones.

Reducing the load of all those social plugins

Some javascript libraries are big and they pose a stress on the browser when they are loaded all at the same time.

Most of these libraries suggest to load them in async.

Library examples

  • Google Analytics
  • Google Maps
  • Google Plus
  • Facebook Like
  • Twitter button

Problems with existing solution

  • A lot of ugly script tag with semi-minified code.
  • Most of the library suggest the same trick so we see code duplication.
  • You have to go copy-paste the code each time you start a new project.
  • If you have multiple libraries, they will all fire at the same time, possibly causing a stress on the browser
  • While the libraries are loading (usually with a lot of dependencies) the browser is sluggish and users probably want to read that article before clicking on 'Like'.

Solution

  • Load each library with a function taking an id, a url and a delay.
  • Allow each library to have a different delay.
    • You may want Google Maps to load only 200ms after the page, but the Facebook Like can wait a bit more.
  • Code on Github
  • Working example on jsfiddle

Wednesday, August 1, 2012

Is the online ad industry a fraud?

As highligthed by MacLeans, the Advertising Research Foundation did an experiment showing that:
A blank banner ad received more clicks than the average Facebook ad, twice as many as your average “branded” display ad (a static ad which promotes a brand rather than a specific offer or call to action), and only one click in ten thousand less than the average banner ad.
While we all hate advertising when browsing, it provides nice revenues and is often how a website is able to live.

What the ARF shows us is that the companies paying for advertising may be paying for air or, what is more likely to be true, inflated data.

They also asked the users if it was a mistake or if they were curious about a “blank”:
The average click-through rate across half a million ads served was 0.08%, which would be good for a brand campaign, and so-so for a direct response campaign. We detected no click fraud in the data we counted. Half the clickers told us they were curious, the other half admitted to a mistaken click.
The conclusions are yours to make.

Solving complex network problems using… fungus

Here is a part of a documentary on decay that aired on BBC.

The part I am talking about is at 1:03:45 and demonstrates a slime mold searching for its food, optimizing the network path between food sources and even creating some backup routes.

This is quite interesting considering that network resolution is a hard problem that requires quite a bit of processing.


P.s. The rest of the show is quite fascinating as well

Generate a random string of non-ambiguous numbers and letters.

I needed to generate a unique code, suitable for URLs and somewhat short so it doesn’t scare the user.

Doing md5(microtime()) is nice but it is 32 chars long. I could use base64, but it contains weird characters like "/" and "=".

So how do I restrain the encoding to alphanumeric ? I found the inspiration from Ross Duggan and I added a random function. He also removes a couple ambiguous chars, which I think is a nice touch.

The choice of random could be discussed but the idea is there.

UPDATE: As suggested by a friend, it is simpler and a better use of the range of available entropy to generate each character separately. You can see the old version in the Gist.

Tests included

Friday, July 13, 2012

Introducing Dotfiles Builder

Managing bashrc sucks

We all have our nice little bashrc that we are proud of. It tests for files, programs and terminal features, detect your OS version, builds a PATH, etc. For all of our OS and different setups, various solutions exist.

Keeping several versions

Pros:

  • Ultimate fine-tuning
  • Easy to understand
  • Usually optimized for every setup

Cons:

  • Very time consuming to manage
  • Hard to “backport” new ideas

Keep a single unusable file with everything and edit accordingly

Pros:

  • Easy to backport, you just need to rembember to do it
  • Good performance
  • Since you edit at each deployment, nice fine-tuning capability

Cons:

  • The single file can become unbearably cluttered.
  • You eventually end up managing several version.
  • Tedious to edit at each deployment

Include several subfiles

Pros:

  • Still have a lot fine-tuning capabilities
  • If well constructed, can be easy to understand
  • Easy to deploy new features

Cons:

  • Hard to detect which file to include
  • Multiplicates the number of files to manage
  • Slow performance
  • Until recently, this was my prefered method.

Wanted features

So, what does a good bashrc have?

Should have:

  • Good performance. On a busy server, you really don't want to wait 5 seconds for your new terminal because your IO is sky rocketing.
    • Reduce number of included files
    • Reduce tests for environment features
    • Reduce tests for program and files
  • High flexibility
    • Cross-OS compatible
    • A lot of feature detection
    • Ideally, configuration files
  • Ease and speed of configuration
    • It should not take more than a minute to setup a new bashrc
    • If you need to specify your developer email, it would be nice to do it only once.

Yes, you read right, reduce tests AND do a lot of feature detection. You don't want to do Java specific configuration or set an empty variable if Java is not even installed, but you do want Java to be automatically detected.

Generating a bashrc

Let's face it, you will install or remove Java way less often then you will start a new shell. Why then test for Java at each new shell?

This is where I introduce the Dotfiles Builder. The script runs in Bash and outputs the wanted bashrc.

This way, instead of doing:

if [ -d "$HOME/bin" ]; then
  PATH="$HOME/bin:$PATH"
fi

You would do:

if [ -d "$HOME/bin" ]; then
  echo "PATH=\"$HOME/bin:$PATH\""
fi

And the result would simply be

PATH="$HOME/bin:$PATH"

But constructing PATH is a rather common task and you want to make sure the folder is not already on your PATH. Why not wrap it up ?

Take a look at the alpha version: https://github.com/lavoiesl/dotfiles-builder/
As well as the example output.

This is a very alpha version of the intended program, but I still want to share what I have and maybe get some feedback and collaborators along the way. Currently, it only generates a bashrc, but expect more to come.

Thursday, July 12, 2012

Use credentials in /etc/mysql/debian.cnf to export MySQL database

Quite a usual task is to dump a database to do backups. You may even want to do this in a cronjob to snapshots, etc.

A very bad solution

A very bad solution is to hardcode the root password in the cronjob or in your backup script; doing so have a very high chance of exposing your password.

  • It may appear in the cron.log
  • It may be sent by email if you have an error
  • It may appear in your history
  • It is a bad idea to your backups using the root account

A better solution

You could create an account with read-only access to all your databases and use it to to your backups. This is indeed better but can lead to the same issues mentioned above

Putting the password in a file

The safest way to use passwords on the command line is to store them in a file and have a script load them when needed. You then just need to make sure those files have the correct permissions

An “already done for me” solution

As it turns out, installations of dbconfig on Debian/Ubuntu creates a user called debian-sys-maintainer. It is used to do MySQL management, mainly through the package manager. Well, this user has all the needed privileges to backup your database and you are sure it will always work. Unless, of course, you manually change the password without updating the file.

This script uses sudo so it will ask your password even if you forgot to prepend sudo.

Typical usage

$ export-database.sh my_database [mysqldump options] | gzip > /tmp/my_database.sql.gz

PHP script to replace site url in Wordpress database dump, even with WPML

Wordpress has the nice habit of storing every URL with its full path in the database. You sure can hardcode the HOME_URL and SITE_URL in the wp-config.php but it won't change the references to your medias, serialized strings, encoded HTML, etc.

The only solution is really just to edit the database. At least, I haven't found a better solution.

Usage

wordpress-change-url.php http://old-domain.com https://new-domain.com < database.orig.sql > database.new.sql

Or

wordpress-change-url.php database.orig.sql https://new-domain.com > database.new.sql

Will output all remaining mentions of http://old-domain.com to stderr

Apache VirtualHost Template with variable replacement

This is in no way a robust solution for deploying a real web hosting infrastructure, but sometimes, you just need basic templates. I use this simple template on my dev server.

Remove temporary and junk files from Windows and OS X

One of the most annoying things with being able to see all files in the terminal is that… you see all the files. That includes backups, swap and temp files.

Well, it’s a rather good thing, it remembers you to remove them once in a while.

Monday, July 9, 2012

Document Root fix in .htaccess when using VirtualDocumentRoot

CMS often come with a .htaccess that has a RewriteRule like this:

# RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.*)$ app.php [QSA,L]

The RewriteBase has to be enabled if you are using a VirtualDocumentRoot but when you are sharing code, developers may all have different setups.

By checking the content of DOCUMENT_ROOT, we can guess which setup we are using and prepend a / when necessary.

Note however that this method is heavy on string comparison which is slow and should not go on production.

Writing a PHPUnit test for Symfony2 to test email sending

The idea is to test that an email is sent by a contact form by inspecting the SwiftMailler Collector.

We generate a fake unique content and loop through all sent emails to verify that is was sent.

Note that this does not actually test if the email is sent to the server but merely tells you if Symfony is trying to send it.


Inspiration: http://docs.behat.org/cookbook/using_the_profiler_with_minkbundle.html

Sunday, July 8, 2012

Validating emails using SMTP queries

Email validation are traditionally done in three flavors. They have pros and cons but they are widely spread; let us go through them.

Regex validation

This is perhaps the most simple validation validation. May you do it via <input type="email">, Javascript or a server-side language, it's basically the same technique: checking for the format of the email. The problem is, about any format is valid [Wikipedia examples] . Most regex are oinly validating a subset of the RFC even though I doubt someone uses ""@example.org as his email.


Upside: Very early validation, can be done client-side to prevent common errors.
Downside: Only checks for format and if you want to support the full RFC, it is a nice but rather "light" validation.

MX record check

A MX record basically is at which IP address the email will end-up. For example, gmail.com points to 173.194.77.27. Doing a MX record means checking if the domain of the email address is valid and if it points somewhere. If yes, assume the server knows how to handle the email. This is done server-side.

Upside: You do a real test on the domain name, it will prevent errors like bob@gmai.lcom.
Downside: You do not test if there is actually a mail server accepting emails at this domain and the full address is not validated.

Real validation using a confirmation email.

The most common are secure way to validate an email is to actually send one and wait for the user to interact with it. Confirmation code, link to click, etc. You really are sure the email exists, but this is not really error-friendly: if the email is invalid, how will you or the user know ? For sure you can setup a daemon that will receive and parse bounce-backs, but how will you let the user know and do you seriously want to do that ? It also won't let you know if you hit a default email which sends all unknown emails to another account.

Furthermore, it usually annoys the user.

Introducing SMTP query

I wrapped up some code I found about a asking the SMTP server if the email account exists. Here it is on GitHub: https://github.com/lavoiesl/smtp-email-validator

It starts by a MX record check and then, using the SMTP protocol, open a socket to the server and starts writing an email saying HELO, MAIL FROM and RCPT TO. The last one is the user, the email you want to test. Now three things may happen:
  1. This email is known, server replies 250 => Email is valid.
  2. Email is greylisted or some minor occurs: 450 or 451 => Email should be valid.
  3. Other error => Email is invalid.
This is a rather quick test to do and you are now pretty sure the email is valid.

Conclusion

The complete way to handle email validation would be:
  1. Handle common format issues using client-side and/or server-side regex validation.
  2. MX record check
  3. SMTP query
  4. Send a confirmation email, not asking for any validation but explaining to the user that he should contact you if he doesn't receive an email.