Tag Archives: programming

Mysql Client Switch for Vertical Output

Did you ever need to examine one line from a MySQL table and were unable to line up the column headers with the data because the line was too long?

Well no fears, it turns out the MySQL client has a switch (\G) you can use to arrange output vertically for easier reading. This is very handy when you are looking at only one line. Here’s an example:

mysql> SELECT * FROM emp_vw WHERE id = 10110 LIMIT 1 \G
*************************** 1. row ***************************
id: 10110
company_id: 21100
last_name: Stall
first_name: John
gender: MALE
email: test@example.com
address: 277 142nd Ave
city: New York
state: NY
phone: (888)555-1212
birthday: 1963-05-12
active: 1
timestamp: 2011-11-30 18:58:50
1 row in set (0.01 sec)

This is a great way to examine the content of the fields of any table for the purposes of investigation. It works very well for views and tables with many columns that form long lines when displayed horizontally. This can also be applied to your my.cnf file, but I prefer to leave it off so it can be enable with the MySQL client \G switch only when needed.

Optimizing Mysql Data Types

Optimizing MySQL datatypes is an important task for anyone, whether you are running a single website or a cluster of geographically distributed MySQL machines. Efficiency in your database driven application all begins with selecting the right datatypes for your columns.

In the process of a recent project I had to get data from a third party source and store it in a local MySQL database, for which we chose to build a dedicated MySQL server machine with a RAID 10 disk array. I had no idea what the data we retrieve would look like. All we had was a list of columns and nothing else. In order to get started we had to guess at the length of the data based on the column names assuming they were descriptive of the data they held. An initial table contained column names followed by simple VARCHAR(255) declarations. Totally unacceptable, but it was a start.

What you got there?

Once the data came back I had to examine each column in each table to see what they held. There are many ways of doing this including using MAX() and MIN() funtions like so:

mysql> SELECT MAX(LENGTH(first_name)) FROM emp;
| MAX(LENGTH(first_name)) |
| 14 |

This tells me that the maximum length of an employees first name is only 16 characters long. So why a VARCHAR(255) or a CHAR(100) or anything else? Let’s look at an example of the longest names in the table:

mysql> SELECT DISTINCT(first_name) FROM emp WHERE LENGTH(first_name) = 14;
| first_name |
| Thomas Michael |

That query gives us only the distinct uses of first_name that are 14 characters long. An examination of the real data is important when selecting datatypes, so you may want to do this again using > 10 in place of = 14, and you can even include a count of how many employees have large first_names, and any other queries that may help you understand the data you are working with.

The Average Joe

Another useful query is to determine the average length of the first name column:

mysql> SELECT AVG(LENGTH(first_name)) FROM emp;
| AVG(LENGTH(first_name)) |
| 5.9771 |

With the average length of a column being only 6 characters, does it make sense to define a fixed field of CHAR(14) to accomodate the longest field when only one employee has a first name that long? Would it be better to use a VARCHAR column anyway? A quick examination of how many names have each character count is required:

SELECT LENGTH(first_name) AS size,
FROM emp
GROUP BY LENGTH(first_name);

| size | COUNT(*) |
| 1 | 0 |
| 2 | 21 |
| 3 | 85 |
| 4 | 626 |
| 5 | 1259 |
| 6 | 1626 |
| 7 | 958 |
| 8 | 345 |
| 9 | 183 |
| 10 | 38 |
| 11 | 37 |
| 12 | 9 |
| 13 | 9 |
| 14 | 1 |


This tells me that most people in our system have first names with an average character length between 4 and 7. Posessing and understanding this information will allow me to make decisions on the best column type. This information is also valuable when creating partial indexes on VARCHAR columns.

Procedure Analyse

One function available with MySQL is PROCEDURE ANALYSE. The manual for it is here:


Replace 5.0 in the URL with your version.

The description from the manual for ANALYSE is:

ANALYSE() examines the result from a query and returns an analysis of the results that suggests optimal data types for each column that may help reduce table sizes.

That just took the guesswork out of your column declaration optimizations! Knowing your data is still critical because ANALYSE is still capable of guessing wrong, but it is another tool to assist you. To view the output of ANALYSE for the first name column, issue a query as follows:

SELECT first_name
FROM emp

| Field_name | Min_value | Max_value | Min_length | Max_length | Empties_or_zeros | Nulls | Avg_value_or_avg_length | Std | Optimal_fieldtype |
| schema.emp.first_name | A | Zyrida | 1 | 16 | 0 | 0 | 5.9771 | NULL | VARCHAR(16) NOT NULL |
1 row in set (0.01 sec)

As you can see in the output, PROCEDURE ANALYSE() returns a single row with:

  • the column name
  • the MIN and MAX values (not important for strings)
  • the MIN and MAX lengths (queries we ran manually above)
  • how many columns are empty or NULL
  • the AVG value (for numbers) or AVG length (for strings)
  • the population standard deviation (STD)
  • a recommended column definition called Optimal_fieldtype

Now, you have to be careful with this procedure when considering accepting the Optimal_fieldtype when passing the default arguments to PROCEDURE ANALYSE(). In a lot of cases, even in the case of a BOOLEAN type, it will return an ENUM, such as ENUM(‘0’, ‘1’). Close attention to detail is important in these cases. In the following example the column only holds one possible values at this time. It is an unfortunate case, but the data that is being returned to me uses this as a foreign key and as part of a primary key on many tables.

FROM company
| Field_name | Min_value | Max_value | Min_length | Max_length | Empties_or_zeros | Nulls | Avg_value_or_avg_length | Std | Optimal_fieldtype |
| schema.company.id | 20000 | 20000 | 5 | 5 | 0 | 0 | 20000.0000 | 0.0000 | ENUM('20000') NOT NULL |
1 row in set (0.00 sec)

In actuality, this ID is a numeric value, and further numeric ID values may be added over time, so a numeric data type would be more approriate than an ENUM. In order for PROCEDURE ANALYSE to return a correct number to us we’ll have to pass some arguments to it. According to the manual, the arguments are as follows:

max_elements (default 256) is the maximum number of distinct values that ANALYSE() notices per column. This is used by ANALYSE() to check whether the optimal data type should be of type ENUM; if there are more than max_elements distinct values, then ENUM is not a suggested type.

max_memory (default 8192) is the maximum amount of memory that ANALYSE() should allocate per column while trying to find all distinct values.

Retrying that query with arguments (1,24) we get:

FROM company

| Field_name | Min_value | Max_value | Min_length | Max_length | Empties_or_zeros | Nulls | Avg_value_or_avg_length | Std | Optimal_fieldtype |
| schema.company.id | 20000 | 2000 | 5 | 5 | 0 | 0 | 20000.0000 | 0.0000 | SMALLINT(5) UNSIGNED NOT NULL |
1 row in set (0.00 sec)

As you can see, optimize is now suggesting an Optimal fieldtype of SMALLINT(5) UNSIGNED NOT NULL which has a numeric value range of 0 to 65535 which is adequate for the purposes of our current and future ID allocation in this column.

Planning for the future

One thing to remember is that the data you are seeing today isn’t necessarily representative of the data you’ll get in the future, unless you have a really large set defined by someone else. You’ll have to be careful with the data types you define. Don’t make them unnecessarily large, but you can plan some wiggle room from the PROCEDURE ANALYSE optimal value. Use your mind here.

If you are creating your own data from scratch such as with a form entered on a website, then this technique will be most usefully applied by you after you’ve been up and running for some time and have some real data to analyze. If you are working with a small schema you’ll be able to ALTER your columns to the new data types after the fact.

Bad PHP Code – Blogs and CMS

Lately people have been blaming PHP for being insecure. While most of them seem to blame little tiny issues that, when used incorrectly by the programmer, make their scripts insecure. Does that make sense? The programmer uses it insecurely, but the problem is in the language. Absurd. Guess what, you can do that with any language. You can make your house insecure too by failing to fasten the door locks. But more to the point, after examining some of the Blog and CMS software for PHP I’ve come to this conclusion – if that kind of programming is general across the board for these packages then we have a bigger problem.

Blame the Messenger

The first thing we need to know is that we are not all knowing.

Often times we get people learning programming from PHP because it is so easy. That’s what I did back in the days of PHP 2. PERL was just a little too hard for me, and in 1995 I had a real hard time scouring information on the web to figure out what CGI was. But I went from getting my feet wet in PHP to learning C to going to college, then going into full-time work. I was lucky.

When I think back to my early days of PHP I can see how we can be dangerous. We typically focus on getting something done, not really how to do it or the best way or even the most secure way. I was so proud when I built the unthinkable using PHP the first time.

The problem when we are that young in a skill is that we think, “OK, I’ve learned this skill, let’s move on.” What we are forgetting is that we know a language but we don’t know how to use it. We’ve figured out where to put the pieces to form sentences – that is, how to write code to not get errors. The problem is, we still don’t know how to have a conversation.

What I’m getting at here is that sometimes we forget that we need to consider why we write code. We need to understand security. We need to know how the web works. We need to know how a browser renders HTML or CSS or XML or whatever you are working with. We need to know the consequences of using a database system, the overhead caused by connects. We need to know how to bug track, profile and optimize for best efficiency. There is much more work to do once you your code “just works.”

Extending the Problem

That is what happens when we are young. Then we explore, then we work, then we forget. That’s me 🙂

As we move along and take on big projects we need to keep on moving upward. From what I’ve experienced, it is common for people to get interested in one area and focus on that. We know programmers who don’t know a thing about hardware. They don’t know what a stack or register is even. On the other side, we have the hardware pros who think they can program because they know the basics of a language. They don’t know what a stack or register is even 🙂

CMS and Blogs

How does this apply to Blog software? I really don’t know. I can’t tell what those individuals and development teams were thinking when they spec’ed out their project. IF they did that. I don’t know what they were thinking when they decided to put PURE PHP code inside template files made for HTML coders, graphics designers and common citizens. Seeing “DON’T EDIT THIS CODE” inside a template file is not acceptable in a project. I especially don’t know what they were thinking when they decided to make “index.php” a 10,000 line piece of code that did everything except the admin.

Obviously there are several CMS and Blog packages for PHP out there that are less than ideal. Some are very big and popular. I’ve seen absolutely useless database connects and queries inside a page that slow it down to a crawl. Major packages with 15+ queries on a blank template page that only returns navigation. You don’t want to install this kind of stuff on a shared host, and if you even get the kind of traffic your blog deserves you’ll be looking for a new package at a really inopportune time….and then figuring out how to migrate your data.

The End

I’m hoping soon we see the end of this kind of thing. The trouble is that PHP is such an easy thing to learn that we won’t. What we need is a crop of fantastic programmers to step up and build something we can be proud of. I have a feeling that won’t happen though. Most of those programmers are employed customizing the blog software that already exists out there for people who found the limitations 🙂